Create, design and implement algorithms and devices for better sound capturing, spatial filtering and noise suppression. Applications include sound captruring for personal computers and meetimng rooms. Sound source localization algorithms are used to aim the beam towards the current speaker.
Publications about the Microphone array project
- Microphone array on Microsoft Insider - interview with Rico Malvar, April 2004 (10 Mbytes!).
- "Defeating ambient noise" - interview with Rico Malvar and Ivan Tashev, July 2004.
- "Microphone array support coming with Windows Vista" - blog, April 2006.
- "Microphone Array for Audience Capture in Lecture Rooms" - presentation by Rong Hu, intern in Microsoft Research, August 2006.
- "Robust Constrained GSC Algorithm for Microphone Array Processing" - presentation by Byung-Jun Yoon, intern in Microsoft Research, September 2006.
- "The future, and why I'm at Microsoft: Array Microphones" - blog by Richard Sprague, March 2007.
- "Insight on Vista's Microphone Array Technology" - Softpedia, September 2007.
- "Windows Vista and Digital Microphone Arrays" - Softpedia, November 2007.
Why we need good sound capture and what harms the sound capture quality?
PCs and other computing devices usually can play sounds well, but they do a poor job at recording. With today’s processing power, storage capacities, broadband connections, and speech recognition engines of the computing world, there’s an opportunity for computing devices to use sounds to deliver more value to customers. They can provide better live communication than phones, much better recording/playback or note-taking devices than tape recorders, and better command UIs than remote controls. However, most machines still use the old paradigm of a single microphone, and that doesn’t work because the microphone picks up too much ambient noise and adds too much electronic noise. So, today people have to use tethered headsets if they need good sound quality.
The solution for better sound: microphone arrays
A system of several closely-positioned microphones is called microphone array. Having the sound signal captured from several points allows, with proper processing, for spatial filtering, also called beamforming. That means the sensors and associated processing will amplify more the signal coming from specific directions (the beam), attenuating signals from other directions. The main benefits of using this technique are:
- Reduction of ambient noises.
- Partial de-reverberation, because most indirect paths are attenuated.
- Reducing the effects of electronic noise.
As result we have better signal-to-noise ratio and a dryer sound, leading to much better user experience and lower error rate in speech recognition.
How it works?
Microphone array processing in direct or indirect form consists of two main procedures: sound source localizer and beamformer. The first finds where the sound source is and should work reliably under noisy and reverberant conditions. It tells the beamformer where to focus the microphone array “beam”.
Sound Source Localization
There are several approaches to determine the direction to the sound source.
Time delay estimates (TDE) based methods uses the fact that the sound reaches the microphones with slightly different times. The delays are easily computed using cross-correlation function between the signals from different microphones. Variations of this approach use different weighting (maximum likelihood, PHAT, etc.) to improve the reliability and stability of the results under noise and reverberation conditions.
As most of the microphone arrays today have more than two microphones there are several ways to compute the overall direction. Finding the direction from all possible pairs and averaging it doesn't work well in case of reverberation. The most common method is testing the hypothesis for direction of arrival using the sum of all cross-correlation functions with proper delays.
Another approach is to steer the beam and to compute the direction based on the maximum output signal. This method gives similar results to time delay estimates with ML weighting.
In all cases post-processing of the sound source localization results is critical. Various methods are used, ranging from particle filtering to real-time clustering. The goal of the post-processor is to remove accidental reflections and reverberations, leaving the results from one or more sound sources.
Our implementation uses a novel approach, not based on computing the cross-correlation functions. The prost-processor is real-time clustering based algorithm. The chart below shows actual results from sound source localizer - two persons in conference room talking at 5 and 37 degrees, distance 6 feet, normal noise and reverberation conditions. The horizontal axis is the time in seconds, the vertical is angle in degrees. The green dots are the results from the sound source localizer, the blue stars are the post-processor output, and the red crosses are where actually points the capturing beam.
The canonical for of the time invariant beamformer in frequency domain is just a weighted sum:
where Xm(f) is the signal, captured from i-th microphone, Y(f) is the beamformer output and Wm(f) are the time invariant frequency dependent weights. With properly designed weights we can aim the beam to given direction, reducing the ambient noise and reverberation. Sample directivity pattern for beam, pointing at 0 degrees, is shown on the next figure.
While fast and reliable, the time invariant beamformer has limited performance (noise suppression and directivity index). More sophisticated adaptive beamforming algorithms are known in the research community. We have designed ours as adaptive spatial filter running after the beamformer.
Our experimental work was done on two USB microphone arrays: one four element linear array and one eight element circular array. Both microphone arrays use unidirectional cardioid microphones.
The four element linear array is designed to stay on the upper bezel of the monitor, works in ±50O range, and covers one office. It has length of 195 mm.
The circular array is for the center of the conference room table, works in 360O degrees and is designed to capture meetings. It has diameter of 160 mm.
Microphone Array support in Windows Vista
And, yes, Windows Vista was shipped with integrated microphone array support. For more details see our presentation during WinHEC 2005 in Seattle and Taipei. More information you can find in our white papers:
Questions about adoption and usage of this technology can be sent to micarrex--at--microsoft--dot--com.
- Mark R. P. Thomas, Hannes Gamper, and Ivan J. Tashev, BFGUI: An Interactive Tool for the Synthesis and Analysis of Microphone Array Beamformers, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE – Institute of Electrical and Electronics Engineers, Shanghai, China, March 2016.
- Ivan Tashev, Kinect Development Kit: A Toolkit for Gesture- and Speech-Based Human-Machine Interaction, in Signal Processing Magazine, IEEE, September 2013.
- Larry Heck, Dilek Hakkani-Tur, Madhu Chinthakunta, Gokhan Tur, Rukmini Iyer, Partha Parthasarathy, Lisa Stifelman, Elizabeth Shriberg, and Ashley Fidler, Multimodal Conversational Search and Browse, IEEE Workshop on Speech, Language and Audio in Multimedia, August 2013.
- Kenichi Kumatani, Takayuki Arakawa, Kazumasa Yamamoto, John McDonough, Bhiksha Raj, Rita Singh, and Ivan Tashev, Microphone Array Processing for Distant Speech Recognition: Towards Real-World Deployment, in APSIPA Annual Summit and Conference, Hollywood, CA, USA, 5 December 2012.
- flavio Ribeiro, Dinei Florencio, Demba Ba, and Cha Zhang, Geometrically Constrained Room Modeling with Compact Microphone Arrays, in IEEE Transactions on Audio, Speech, and Language Processing, IEEE, July 2012.
- Ivan J. Tashev, Audio for Kinect: pushing it to the limit (invited talk), in CREST Symposium on Human-Harmonized Information Technology, University of Kyoto, 2 April 2012.
- Ivan J. Tashev, Optimizing Kinect: Audio and Acoustics, in Inormation Technologies and Applications Workshop, University of California - San Diego, 8 February 2012.
- Ivan J. Tashev, Audio for Kinect: Nearly Impossible (invited talk), in IEEE International Conference on Emerging Signal Processing Applications, IEEE SPS, 14 January 2012.
- Sven Nordholm, Thushara Abhayapala, Simon Doclo, Sharon Gannot, Patrick Naylor, and Ivan Tashev, Microphone Array Speech Processing, in EURASIP Journal on Advances in Signal Processing, HINDAWI, 16 September 2010.
- Lae-Hoon Kim, Ivan Tashev, and Alex Acero, Reverberated Speech Signal Separation Based on Regularized Subband Feedforward ICA and Instantaneous Direction of Arrival, in International Conference on Acoustics, Speech and Signal Processing, IEEE, 16 March 2010.