Microphone Array
Microphone Array

Create, design and implement algorithms and devices for better sound capturing, spatial filtering and noise suppression. Applications include sound captruring for personal computers and meetimng rooms. Sound source localization algorithms are used to aim the beam towards the current speaker.

Publications about the Microphone array project

Why we need good sound capture and what harms the sound capture quality?

PCs and other computing devices usually can play sounds well, but they do a poor job at recording. With today’s processing power, storage capacities, broadband connections, and speech recognition engines of the computing world, there’s an opportunity for computing devices to use sounds to deliver more value to customers. They can provide better live communication than phones, much better recording/playback or note-taking devices than tape recorders, and better command UIs than remote controls. However, most machines still use the old paradigm of a single microphone, and that doesn’t work because the microphone picks up too much ambient noise and adds too much electronic noise. So, today people have to use tethered headsets if they need good sound quality.

The solution for better sound: microphone arrays

A system of several closely-positioned microphones is called microphone array. Having the sound signal captured from several points allows, with proper processing, for spatial filtering, also called beamforming. That means the sensors and associated processing will amplify more the signal coming from specific directions (the beam), attenuating signals from other directions. The main benefits of using this technique are:

  • Reduction of ambient noises.
  • Partial de-reverberation, because most indirect paths are attenuated.
  • Reducing the effects of electronic noise.

As result we have better signal-to-noise ratio and a dryer sound, leading to much better user experience and lower error rate in speech recognition.

How it works?

Microphone array processing in direct or indirect form consists of two main procedures: sound source localizer and beamformer. The first finds where the sound source is and should work reliably under noisy and reverberant conditions. It tells the beamformer where to focus the microphone array “beam”.

Sound Source Localization

There are several approaches to determine the direction to the sound source.

    Time delay estimates (TDE) based methods uses the fact that the sound reaches the microphones with slightly different times. The delays are easily computed using cross-correlation function between the signals from different microphones. Variations of this approach use different weighting (maximum likelihood, PHAT, etc.) to improve the reliability and stability of the results under noise and reverberation conditions.

    As most of the microphone arrays today have more than two microphones there are several ways to compute the overall direction. Finding the direction from all possible pairs and averaging it doesn't work well in case of reverberation. The most common method is testing the hypothesis for direction of arrival using the sum of all cross-correlation functions with proper delays.

    Another approach is to steer the beam and to compute the direction based on the maximum output signal. This method gives similar results to time delay estimates with ML weighting.

In all cases post-processing of the sound source localization results is critical. Various methods are used, ranging from particle filtering to real-time clustering. The goal of the post-processor is to remove accidental reflections and reverberations, leaving the results from one or more sound sources.

Our implementation uses a novel approach, not based on computing the cross-correlation functions. The prost-processor is real-time clustering based algorithm. The chart below shows actual results from sound source localizer - two persons in conference room talking at 5 and 37 degrees, distance 6 feet, normal noise and reverberation conditions. The horizontal axis is the time in seconds, the vertical is angle in degrees. The green dots are the results from the sound source localizer, the blue stars are the post-processor output, and the red crosses are where actually points the capturing beam.



The canonical for of the time invariant beamformer in frequency domain is just a weighted sum:

where Xm(f) is the signal, captured from i-th microphone, Y(f) is the beamformer output and Wm(f) are the time invariant frequency dependent weights. With properly designed weights we can aim the beam to given direction, reducing the ambient noise and reverberation. Sample directivity pattern for beam, pointing at 0 degrees, is shown on the next figure.

While fast and reliable, the time invariant beamformer has limited performance (noise suppression and directivity index). More sophisticated adaptive beamforming algorithms are known in the research community. We have designed ours as adaptive spatial filter running after the beamformer.

Experimental hardware

Our experimental work was done on two USB microphone arrays: one four element linear array and one eight element circular array. Both microphone arrays use unidirectional cardioid microphones.

The four element linear array is designed to stay on the upper bezel of the monitor, works in ±50O range, and covers one office. It has length of 195 mm.

The circular array is for the center of the conference room table, works in 360O degrees and is designed to capture meetings. It has diameter of 160 mm.

Microphone Array support in Windows Vista

And, yes, Windows Vista was shipped with integrated microphone array support. For more details see our presentation during WinHEC 2005 in Seattle and Taipei. More information you can find in our white papers:

"Microphone Array Support in Windows Vista"
"How to Build and Use Microphone Arrays for Windows Vista"
"Microphone Array Verification Tool"

Questions about adoption and usage of this technology can be sent to micarrex--at--microsoft--dot--com.