This project is about audio processing techniques and algorithms. It covers sound capture and speech enhancement topics, acoustic echo cancellation, de-reverberation, microphone array processing. On the playback side are loudspeaker arrays, spatial sound and sound picture formation.
This topic includes denoising algorithms. In most of the cases this is applying a time varying real gain to the short term frequency transformation of an audio frame extracted from the input signal. This class of algorithms is denoted as noise suppression. Various criteria for estimation of this suppression gain (rule) were derived historically: magnitude minimum mean square error (Wiener, 1947), spectral subtraction (Boll, 1979), maximum likelihood (McAllay and Malpass, 1980), short-term MMSE (Ephraim and Malah, 1984), log MMSE (Ephraim and Malah, 1985), etc.
In other cases the unwanted signal can be reduced using some prior knowledge (a harmonic nature allows predicting the signal in the next frame), or even knoledge of some filtered version of the noise itself, when we talk about noise cancellation algorithms.
Our point is that what matters is how the output sounds to humans (if the target is human ears) or what is the reduction in the word error rate (if the target is speech recognizer). We use psychoacoustic criteria and large signal corpus to tune the algorithms to maximize the perceptual sound quality.
Microphone array processing
A single microphone captures too much noise and reverberation, especially in noisy environment. One of the solutions to capture better sound is to use more microphones. The signals from these microphones, combined in certain way, increase the directivity of the device and reduce the captured noise and reverberation. Using multiple microphones allows localization of the sound source and pointing the directivity pattern maximum towards the desired sound source.
Initially we designed this technology for office and conference room scenarios. Our technology is part of Windows Vista. More details can be found on the Microphone array project web page.
Using multiple microphones can improve the sound capturing quality of headsets and small form factor devices as cell phones and portable media players. The classic beamforming is less effective due to the smaller size of the array, which requires employing additional spatial suppression techniques. On the following picture a head and torso simulator is wearing a headset with three element microphone array.
What is reverberation and why it hurts?
When a sound source is placed in closed room or near sound reflecting surfaces the listener receives not only the direct wave, but in addition multiple reflected waves. This smears the speech features and makes it less intelligible for humans and reduces the recognition rate for speech recognition engines. Therefore for best speech recognition results users are forced to use headsets with close-talk microphones.
Dereverberation as deconvolution
Nearly every approach assumes a convolutional model for the effects of reverberation. Then it is logical to try undoing the effects of reverberation by deconvolution (inverse filtering). This can be done mathematically perfect only if the room response is minimum phase, i.e. is causal, invertible and the inverted function is causal. As in most of the cases this is not true usually the de-convolution function is an approximation. Estimating the room response is more difficult in presence of noise.
Blind dereverberation methods seek to estimate the input without explicitly computing a de-convolution or inverse filter. Some methods use probabilistic speech models and even Independent Component Analysis (ICA).
Dereverberation via suppression and enhancement
This approach tries to remove the reverberation effects by methods used for noise suppression and speech enhancement. These algorithms will either try to suppress the reverberation, to enhance the speech or both. In contrast of blind algorithms, however, there is no source signal estimation either. Rather the waveform is processes to reduce the negative effects of reverberation and enhance qualities of the captured waveform.
Our approach and initial results for ASR
The initial goal of our dereverberation project is to improve the speech recognition results from our microphone array for distances up to 3 feet and to make them as close as possible to close talk microphone. Most of the modern ASR systems have Cepstral Mean Normalization in the front end. The purpose of this processing id to compensate the frequency response of the capturing channel, but, due to relatively fast adaptation time, it successfully compensates the early reverberation - up to 50 ms. At that time the rate of reflections arriving already exceeds the sampling rate, converting the reverberation to a stochastic process. Estimation of the room response would not give us good results under these conditions, therefore we choose to do de-reverberation via suppression.
Initial results are shown in the next charts. A test set of ~3000 utterances was recorded using close talk microphone, regular analog PC microphone and the four element microphone array in a conference room from distances 1.0 and 2.5 meters. The sound was played trough B&K mouth simulator.
This is pure research project for now. We decided to see what can be reused from our experience in beamforming design for microphone arrays. The loudspeaker array consists of sixteen inexpensive speakers and has linear geometry. The project was demonstrated during Microsoft Research TechFest 2007 as "Personal Audio Space" and definitely had the "Wow!" effect among the visitors in our booth. We demonstrated focusing the sound in given area and dual beam mode when you hear one music channel in one place and a second music channel in another. The attending journalists liked the demo and it was widely published in the press: WIRED Blog Network, Seattle PI, MIT Technology Review, MSR web site, many others, in different languages and from different countries. Currently we are exploring various scenarios and potential applications for this technology.
Multichannel Acoustic Echo Cancellation
Acoustic echo cancellation (AEC) is part of every communication device working in speakerphone mode. It estimates the transfer path between the loudspeaker and the microphone and then subtracts the signal we send to the loudspeaker from the captured by the microphone signal. This prevents annoying echoes and feedback in the closed loop communication system. The AEC employs an adaptive filter using typically NLMS, but in some cases RLS, AP (or FAP) algorithms.
Stereo, and later multichannel (for surround sound systems), acoustic echo cancellation is not a trivial problem and sparked a lot of interest in the research community. The problem is that the stereo channels are highly correlated which leads to infinite number of solutions. Only one of them is the real solution, for all others the AEC has to readapt when something changes in the stereo signal.
We demonstrated multichannel acoustic echo cancellation and sound capturing with eight element circular array during Microsoft Research TechFest 2008. Our approach is based on starting the adaptation process at the right solution and reducing the degrees of freedom of the adaptive system to the right solution only. Our setup is shown on the next picture.
Potential applications of the Multichannel Acoustic Echo cancellation include voice control of multimedia systems, advanced communication systems, gaming scenarios. Video of this demo canb be found here.
The anechoic chamber
In our audio projects we actively use the anechoic chamber for measurements of microphones and microphone arrays directivity pattern, generation of reference recordings, measuring the sound filed around speaker arrays, etc. The chamber provides very low noise and low reverberation environment. It is part of the new building Microsoft Research moved in at the end of 2007 and become one of the places which attracted visitors and journalists. MIT's Technology Review, Computerworld, Wired.com, and some bloggers posted articles or notes about our audio facility.
- Jinkyu Lee and Ivan Tashev, High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition, in Interspeech 2015, ISCA - International Speech Communication Association, 8 September 2015.
- Ivan J. Tashev, Offline Voice Activity Detector Using Speech Supergaussianity, in Information Theory and Applications Workshop, University of California - San Diego, 3 February 2015.
- Kun Han, Dong Yu, and Ivan Tashev, Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine, in Interspeech 2014, September 2014.
- Ivan J. Tashev, Technological Trends in Natural User Interfaces, in Computing Now, vol. 7, no. 9, pp. [online], IEEE – Institute of Electrical and Electronics Engineers, September 2014.
- Ivan Dokmanic and Ivan Tashev, Hardware and Algorithms for Ultrasonic Depth Imaging, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 9 May 2014.
- Ivan Tashev, HRTF Phase Synthesis via Sparse Representation of Anthropometric Features, in Information Theory and Applications Workshop, University of California - San Diego, 13 February 2014.
- Ivan Tashev, Kinect Development Kit: A Toolkit for Gesture- and Speech-Based Human-Machine Interaction, in Signal Processing Magazine, IEEE, September 2013.
- Edmund Lalor, Nima Mesgarani, Siddharth Rajaram, Adam O'Donovan, James Wright, Inyong Choi, Jonathan Brumberg, Nai Ding, Adrian KC Lee, Nils Peters, Sudarshan Ramenahalli, Jeffrey Pompe, Barbara Shinn-Cunningham, Malcolm Slaney, and Shihab Shamma, Decoding Auditory Attention (in Real Time) with EEG, in Proceedings of the 37th ARO MidWinter Meeting, Association for Research in Otolaryngology (ARO), 17 February 2013.
- Ivan Tashev and Malcolm Slaney, Data Driven Suppression Rule for Speech Enhancement, in Information Theory and Applications Workshop , University of California - San Diego, 14 February 2013.
- Kenichi Kumatani, Takayuki Arakawa, Kazumasa Yamamoto, John McDonough, Bhiksha Raj, Rita Singh, and Ivan Tashev, Microphone Array Processing for Distant Speech Recognition: Towards Real-World Deployment, in APSIPA Annual Summit and Conference, Hollywood, CA, USA, 5 December 2012.