I hope you've found this page because we are interested in the same things. This is a picture of me when I lived in a tiny condo during graduate school. That's where I learned to enjoy linear algebra, information theory, and machine learning, which led directly to a career in digital signal processing and automatic speech recognition. More information about my professional life appears below. For those interested in my personal life, navigate to my Droppo family web site.
- Feature transformation
- Speech and acoustic digital signal processing
- Noise-robust speech recognition
- Low-bandwidth robust cepstral transport
- Time-frequency representations
- Nonstationary signal modeling and classification
I have been with Microsoft Research since July, 2000. My primary task has been to explore different techniques to make ASR more robust to additive and channel noise. Other projects I've worked on include general speech signal enhancement, pitch tracking, multiple stream ASR, novel speech recognition features, MiPad multimodal interface, cepstral compression and transport, and the WITTY microphone.
The SPLICE project was successful in building a more robust speech recognition system. Working jointly with Alex Acero and Li Deng, we were able to get amazing results on the noisy Aurora2 corpus. But, there were fundamental problems with the approach.
The model-based feature enhancement project was meant to address the stereo-data requirement for SPLICE. The model describes how speech and noise (and noisy channels) interact to corrupt the speech features. It can be used to either enhance the features before recognition, or to adapt the recognizer's model at run-time.
Lately, I've moved to more general non-parametric warpings of the feature space. We're trying to learn transformations that improve recognition performance in both noisy and clean conditions.
I earned my Ph.D. in Electrical Engineering at the University of Washington's Interactive Systems Design Laboratory in June of 2000. Early in my studies, I helped to develop a discrete theory for time-frequency representations of non-stationary audio signals. The application of this theory to speech recognition was the core of my thesis, "Time-Frequency Representations for Speech Recognition." Other projects I worked on during this time included a GMM-based speaker verification system, subliminal audio message encoding, and non-linear signal morphing.
My MSEE was also earned at the University of Washington, in 1996. During this time, I worked on a project to develop and build an acoustic pyrometer. The device probes the fireball within a coal-fired electrical plant with several sound pressure waves, and determines a temperature profile based on acoustic time of flight measurements. My thesis described the algorithms and techniques developed to make such a device feasable.
I earned my BSEE from Gonzaga University in Spokane, in 1994. My final project consisted of building a control system for a high speed dot-matrix printer. Fuzzy logic was popular at the time, so we used it as the basis of our system. I learned two important lessons from the project, both of which were probably unintentional. First lesson: stay away from fads. Whereas a conventional design would have been well understood and easy to implement with a guaranteed minimum level of performance, our fuzzy controller needed a lot of background work and experimentation to get working correctly. Second lesson: embrace fads. I wrote a paper comparing and contrasting the behavior of fuzzy controlers to linear controllers, and recieved first prize in the region's IEEE paper contest.
This code enables training and evaluation of a switching linear dynamic model for enhancing cepstral streams for automatic speech recognition, as described in our ICASSP 2004 paper, "Noise Robust Speech Recognition with a Switching Linear Dynamic Model."
This archive consists of a set of pitch period and voicing estimates for utterances found in the Aurora 2 corpus using the algorithm described in . Currently, pitch estimates are available for test sets A and B, as well as the clean training data.  H. G. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy condidions", in ISCA ITRW ASR2000 "Automatic Speech Recognition: Challenges for the Next Millennium", Paris, France, September 2000.  J. Droppo and A. Acero. Maximum a Posteriori Pitch Tracking, in Proc. of the Int. Conf. on Spoken Language Processing. Sydney, Australia. Dec 1998.
- Abdel-rahman Mohamed, Frank Seide, Dong Yu, Jasha Droppo, Andreas Stolcke, Geoffrey Zweig, and Gerald Penn, Deep Bi-directional Recurrent Networks Over Spectral Windows, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, IEEE – Institute of Electrical and Electronics Engineers, December 2015.
Long short-term memory (LSTM) acoustic models have recently achieved state-of-the-art results on speech recognition tasks. As a type of recurrent neural network, LSTMs potentially have the ability to model long-span phenomena relating the spectral input to linguistic units. However, it has not been clear whether their observed performance is actually due to this capability, or instead if it is due to a better modeling of short term dynamics through the recurrence. In this paper. we answer this question by applying a windowed (truncated) LSTM to conversational speech transcription, and find that a limited context is adequate, and that it is not necessaary to scan the entire utterance.
The sliding window approach allows not only incremental (online) recognition with a bidirectional model, but also frame-wise randomization (as opposed to utterance randomization), which results in faster convergence.
On the SWBD/Fisher corpus, applying bidirectional LSTM RNNs to spectral windows of about 0.5s improves WER on the Hub5’00 benchmark set by 16% relative compared to our best sequence-trained DNN. On an extended 3850h training set that that also includes lectures, the relative gain becomes 28% (Hub5’00 WER 9.2%). In-house conversational data improves by 12 to 17% relative.
- Chao Weng, Dong Yu, Michael L. Seltzer, and Jasha Droppo, Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE – Institute of Electrical and Electronics Engineers, October 2015.
We investigate techniques based on deep neural
networks (DNNs) for attacking the single-channel multi-talker
speech recognition problem. Our proposed approach contains
ﬁve key ingredients: a multi-style training strategy on artiﬁcially
mixed speech data, a separate DNN to estimate senone posterior
probabilities of the louder and softer speakers at each frame, a
WFST-based two-talker decoder to jointly estimate and correlate
the speaker and speech, a speaker switching penalty estimated
from the energy pattern change in the mixed-speech, and a
conﬁdence based system combination strategy. Experiments on
the 2006 speech separation and recognition challenge task demon-
strate that our proposed DNN-based system has remarkable noise
robustness to the interference of a competing speaker. The best
setup of our proposed systems achieves an average word error
rate (WER) of 18.8% across different SNRs and outperforms the
state-of-the-art IBM superhuman system by 2.8% absolute with
- Ritwik Giri, Michael L. Seltzer, Jasha Droppo, and Dong Yu, IMPROVING SPEECH RECOGNITION IN REVERBERATION USING A ROOM-AWARE DEEP NEURAL NETWORK AND MULTI-TASK LEARNING, IEEE – Institute of Electrical and Electronics Engineers, April 2015.
In this paper, we propose two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments. Both methods utilize auxiliary information in training the DNN but differ in the type of information and the manner in which it is used. The first method uses parallel training data for multi-task learning, in which the network is trained to perform both a primary senone classification task and a secondary feature enhancement task using a shared representation. The second method uses a parameterization of the reverberant environment extracted from the observed signal to train a room-aware DNN. Experiments were performed on the single microphone task of the REVERB Challenge corpus. The proposed approach obtained a word error rate of 7.8% on the SimData test set, which is lower than all reported systems using the same training data and evaluation conditions, and 27.5% on the mismatched RealData test set, which is lower than all but two systems.
- Yu Zhang, Dong Yu, Michael L. Seltzer, and Jasha Droppo, SPEECH RECOGNITION WITH PREDICTION-ADAPTATION-CORRECTION RECURRENT NEURAL NETWORKS, IEEE – Institute of Electrical and Electronics Engineers, April 2015.
We propose the prediction-adaptation-correction RNN (PAC-RNN), in which a correction DNN estimates the state posterior probability based on both the current frame and the prediction made on the past frames by a prediction DNN. The result from the main DNN is fed back to the prediction DNN to make better predictions for the future frames. In the PAC-RNN, we can consider that, given the new, current frame information, the main DNN makes a correction on the prediction made by the prediction DNN. Alternatively, it can be viewed as adapting the main DNN’s behavior based on the prediction DNN’s prediction. Experiments on the TIMIT phone recognition task indicate that the PAC-RNN outperforms DNN, RNN, and LSTM with 2.4%, 2.1%, and 1.9% absolute phone accuracy improvement, respectively. We found that incorporating the prediction objective and including the recurrent loop are both important to boost the performance of the PAC-RNN.
- Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu, 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs, in Interspeech 2014, September 2014.
We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs.
We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain.
For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k frames per second (kfps) when using 2880 samples per minibatch, and 51kfps with 16k, on a server with 8 K20X GPUs. This corresponds to speed-ups over a single GPU of 3.6 and 6.3, respectively. 7 training passes over 309h of data complete in under 7h. A 160M-parameter model training processes 3300h of data in under 16h on 20 dual-GPU servers—a 10 times speed-up—albeit at a small accuracy loss.
- Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Marko Padmilac, Hari Parthasarathi, Baolin Peng, Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, Yongqiang Wang, Huaming Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig, An Introduction to Computational Networks and the Computational Network Toolkit, no. MSR-TR-2014-112, August 2014.
We introduce computational network (CN), a unified framework for describing arbitrary learning machines, such as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short term memory (LSTM), logistic regression, and maximum entropy model, that can be illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each non-leaf node represents a matrix operation upon its children. We describe algorithms to carry out forward computation and gradient calculation in CN and introduce most popular computation node types used in a typical CN. We further introduce the computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU. We describe the architecture and the key components of the CNTK, the command line options to use CNTK, and the network definition and model editing language, and provide sample setups for acoustic model, language model, and spoken language understanding. We also describe the Argon speech recognition decoder as an example to integrate with CNTK.
- Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu, On Parallelizability of Stochastic Gradient Descent for Speech DNNs, in ICASSP, IEEE SPS, May 2014.
This paper compares the theoretical ef?ciency of model-par- allel and data-parallel distributed stochastic gradient descent training of DNNs. For a typical Switchboard DNN with 46M parameters, the results are not pretty: With modern GPUs and interconnects, model parallelism is optimal with only 3 GPUs in a single server, while data parallelism with a minibatch size of 1024 does not even scale to 2 GPUs.
We further show that data-parallel training ef?ciency can be improved by increasing the minibatch size (through a com- bination of AdaGrad and automatic adjustments of learning rate and minibatch size) and data compression. We arrive at an estimated possible end-to-end speed-up of 5 times or more.
We do not address issues of robustness to process failure or other issues that might occur during training, nor of speed of convergence differences between ASGD and SGD param- eter update patterns.
- Nicolas Boulanger-Lewandowski, Jasha Droppo, Mike Seltzer, and Dong Yu, Phone sequence modeling with recurrent neural networks, in ICASSP, IEEE SPS, May 2014.
In this paper, we investigate phone sequence modeling with recurrent neural networks in the context of speech recognition. We introduce a hybrid architecture that combines a phonetic model with an arbi- trary frame-level acoustic model and we propose ef?cient algorithms for training, decoding and sequence alignment. We evaluate the ad- vantage of our phonetic model on the TIMIT and Switchboard-mini datasets in complementarity to a powerful context-dependent deep neural network (DNN) acoustic classi?er and a higher-level 3-gram language model. Consistent improvements of 2–10% in phone accu- racy and 3% in word error rate suggest that our approach can readily replace HMMs in current state-of-the-art systems.
- Chao Weng, Dong Yu, Mike Seltzer, and Jasha Droppo, Single-channel Mixed Speech Recognition Using Deep Neural Networks, in ICASSP, IEEE SPS, May 2014.
In this work, we study the problem of single-channel mixed speech recognition using deep neural networks (DNNs). Using a multi-style training strategy on arti?cially mixed speech data, we investigate several different training setups that enable the DNN to generalize to corresponding similar patterns in the test data. We also introduce a WFST-based two-talker decoder to work with the trained DNNs. Experiments on the 2006 speech separation and recogni- tion challenge task demonstrate that the proposed DNN-based sys- tem has remarkable noise robustness to the interference of a com- peting speaker. The best setup of our proposed systems achieves an overall WER of 19.7% which improves upon the results obtained by the state-of-the-art IBM superhuman system by 1.9% absolute, with fewer assumptions and lower computational complexity.
- R. Prabhavalkar and Jasha Droppo, A Chunk-Based Phonetic Score for Mobile Voice Search, IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP), March 2012.
We propose a chunk-based phonetic score for re-scoring word hypotheses for the mobile voice search task. The score is based on a novel technique for aligning decoded phone sequences with forced-alignments of hypothesized word sequences and exploits phone-boundary timing information. In experimental results, we find that the proposed approach results in relative a word error rate reduction of 4.4% and a relative sentence error rate reduction of 2.3% for the Windows live search for mobile task.
- Yun-Cheng Ju and Jasha Droppo, Automatically Optimizing Utterance Classification Performance without Human in the Loop, in Interspeech, International Speech Communication Association, 28 August 2011.
The Utterance Classification (UC) method has become a developer’s choice over traditional Context Free Grammars (CFGs) for voice menus in telephony applications. This data driven method achieves higher accuracy and has great potential to utilize a huge amount of labeled training data. But, having a human manually label the training data can be expensive. This paper provides a robust recipe for training a UC system using inexpensive acoustic data with limited transcriptions or semantic labels. It also describes two new algorithms that use caller confirmation, which naturally occurred within a dialog, to generate pseudo semantic labels. Experimental results show that, after having sufficient labeled data to achieve a reasonable accuracy, both of our algorithms can use unlabeled data to achieve the same performance as a system trained with labeled data, while completely eliminating the need for human supervision.
- Brian Hutchinson and Jasha Droppo, Learning Non-Parametric Models of Pronunciation, in Proceedings of ICASSP, IEEE SPS, 23 May 2011.
As more data becomes available for a given speech recognition task, the natural way to improve recognition accuracy is to train larger models. But, while this strategy yields modest improvements to small systems, the relative gains diminish as the data and models grow. In this paper, we demonstrate that abundant data allows us to model patterns and structure that are unaccounted for in standard systems. In particular, we model the systematic mismatch between the canonical pronunciations of words and the actual pronunciations found in casual or accented speech. Using a combination of two simple data-driven pronunciation models, we can correct 5.2% of the errors in our mobile voice search application.
- Xing Fan, Michael Seltzer, Jasha Droppo, Henrique Malvar, and Alex Acero, Joint Encoding of the Waveform and Speech Recognition Features Using a Transform Codec, in International Conference on Acoustics, Speech and Signal Processing, Institute of Electrical and Electronics Engineers, Inc., May 2011.
We propose a new transform speech codec that jointly encodes a wideband waveform and its corresponding wideband and narrowband speech recognition features. For distributed speech recognition, wideband features are compressed and transmitted as side information. The waveform is then encoded in a manner that exploits the information already captured by the speech features. Narrowband speech acoustic features can be synthesized at the server by applying a transformation to the decoded wideband features. An evaluation conducted on an in-car speech recognition task show that at 16 kbps our new system typically shows essentially no impact in word error rate compared to uncompressed audio, whereas the standard transform codec produces up to a 20% increase in word error rate. In addition, good quality speech is obtained for playback and transcription, with PESQ scores ranging from 3.2 to 3.4.
- y. c. ju and jasha droppo, Spontaneous Mandarin speech understanding using utterance classification: a case study, in International Symposium on Chinese Spoken Language Processing, International Speech Communication Association, December 2010.
As speech recognition matures and becomes more practical in commercial English applications, localization has quickly become the bottleneck for having more speech features. Not only are some technologies highly language dependent, there are simply not enough speech experts in the large number of target languages to develop the data modules and investigate potential performance related issues. This paper shows how data driven methods like Utterance Classification (UC) successfully address these major issues. Our experiments demonstrate that UC performs as well as or better than hand crafted Context Free Grammars (CFGs) for spontaneous Mandarin speech understanding, even when applied without linguistic knowledge. We also discuss two pragmatic modifications of the UC algorithm adopted to handle multiple choice answers and to be more robust to feature selections.
- Geoffrey Zweig, Patrick Nguyen, Jasha Droppo, and Alex Acero, Continuous Speech Recognition with a TF-IDF Acoustic Model, International Speech Communication Association, September 2010.
Information retrieval methods are frequently used for indexing and retrieving spoken documents, and more recently have been proposed for voice-search amongst a pre-defined set of business entries. In this paper, we show that these methods can be used in an even more fundamental way, as the core component in a continuous speech recognizer. Speech is initially processed and represented as a sequence of discrete symbols, specifically phoneme or multi-phone units. Recognition then operates on this sequence. The recognizer is segment-based, and the acoustic score for labeling a segment with a word is based on the TF-IDF similarity between the subword units detected in the segment, and those typically seen in association with the word. We present promising results on both a voice search task and the Wall Street Journal task. The development of this method brings us one step closer to being able to do speech recognition based on the detection of sub-word audio attributes.
- Jasha Droppo and Alex Acero, Context Dependent Phonetic String Edit Distance for Automatic Speech Recognition, in ICASSP, IEEE, March 2010.
An automatic speech recognition system searches for the word transcription with the highest overall score for a given acoustic observation sequence. This overall score is typically a weighted combination of a language model score and an acoustic model score. We propose including a third score, which measures the similarity of the word transcription's pronunciation to the output of a less constrained phonetic recognizer. We show how this phonetic string edit distance can be learned from data, and that including context in the model is essential for good performance. We demonstrate improved accuracy on a business search task.
- Xiaoqiang Xiao, Jasha Droppo, and Alex Acero, Information Retrieval Methods for Automatic Speech Recognition, in ICASSP, IEEE, March 2010.
In this paper, we use information retrieval (IR) techniques to improve a speech recognition (ASR) system. The potential benefits include improved speed, accuracy, and scalability. Where conventional HMM-based speech recognition systems decode words directly, our IR-based system first decodes subword units. These are then mapped to a target word by the IR system. In this decoupled system, the IR serves as a lightweight, data-driven pronunciation model. Our proposed method is evaluated in the Windows Live Search for Mobile (WLS4M) task, and our best system has 12% fewer errors than a comparable HMM classifier. We show that even using an inexpensive IR weighting scheme (TF-IDF) yields a 3% relative error rate reduction while maintaining all of the advantages of the IR approach.
- Jasha Droppo and Alex Acero, Experimenting with a Global Decision Tree for State Clustering in Automatic Speech Recognition Systems, in ICASSP 2009, IEEE, April 2009.
In modern automatic speech recognition systems, it is standard practice to cluster several logical hidden Markov model states into one physical, clustered state. Typically, the clustering is done such that logical states from different phones or different states can not share the same clustered state. In this paper, we present a collection of experiments that lift this restriction. The results show that, for Aurora 2 and Aurora 3, much smaller models perform as least as well as the standard baseline. On a TIMIT phone recognition task, we analyze the tying structures introduced, and discuss the implications for building better acoustic models.
- Hui Lin, Li Deng, Jasha Droppo, Dong Yu, and Alex Acero, Learning Methods in Multilingual Speech Recognition, in NIPS Workshop, Whistler, BC, Canada, Microsoft, December 2008.
One key issue in developing learning methods for multilingual acoustic modeling in large vocabulary automatic speech recognition (ASR) applications is to maximize the benefit of boosting the acoustic training data from multiple source languages while minimizing the negative effects of data impurity arising from language “mismatch”. In this paper, we introduce two learning methods, semiautomatic unit selection and global phonetic decision tree, to address this issue via effective utilization of acoustic data from multiple languages. The semi-automatic unit selection is aimed to combine the merits of both data-driven and knowledgedriven approaches to identifying the basic units in multilingual acoustic modeling. The global decision-tree method allows clustering of cross-center phones and cross-center states in the HMMs, offering the potential to discover a better sharing structure beneath the mixed acoustic dynamics and context mismatch caused by the use of multiple languages’ acoustic data. Our preliminary experiment results show that both of these learning methods improve the performance of multilingual speech recognition.
- Jasha Droppo, Michael L. Seltzer, Alex Acero, and Y.-H. Chiu, Towards a non-parametric acoustic model: an acoustic decision tree for observation probability calculation, in Proceedings of Interspeech, International Speech Communication Association, Brisbane, Australia, September 2008.
- Dong Yu, Li Deng, Jasha Droppo, Jian Wu, Yifan Gong, and Alex Acero, Robust speech recognition using cepstral minimum-mean-square-error noise suppressor, in IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 5, Institute of Electrical and Electronics Engineers, Inc., July 2008.
We present an efficient and effective nonlinear feature-domain noise suppression algorithm, motivated by the minimum mean-square-error (MMSE) optimization criterion, for noiserobust speech recognition. Distinguishing from the log-MMSE spectral amplitude noise suppressor proposed by Ephraim and Malah (E&M), our new algorithm is aimed to minimize the error expressed explicitly for the Mel-frequency cepstra instead of discrete Fourier transform (DFT) spectra, and it operates on the Mel-frequency filter bank’s output. As a consequence, the statistics used to estimate the suppression factor become vastly different from those used in the E&M log-MMSE suppressor. Our algorithm is significantly more efficient than the E&M’s log-MMSE suppressor since the number of the channels in the Mel-frequency filter bank is much smaller (23 in our case) than the number of bins (256) in DFT.We have conducted extensive speech recognition experiments on the standard Aurora-3 task. The experimental results demonstrate a reduction of the recognition word error rate by 48% over the standard ICSLP02 baseline, 26% over the cepstral mean normalization baseline, and 13% over the popular E&M’s log-MMSE noise suppressor. The experiments also show that our new algorithm performs slightly better than the ETSI advanced front end (AFE) on the well-matched and mid-mismatched settings, and has 8% and 10% fewer errors than our earlier SPLICE (stereo-based piecewise linear compensation for environments) system on these settings, respectively.
- Luis Buera, Jasha Droppo, and Alex Acero, Speech Enhancement using a Pitch Predictive Model, in Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing, Institute of Electrical and Electronics Engineers, Inc., April 2008.
In this paper we present two new methods for speech enhancement based on the previously publised fine pitch model (FPM) for voiced speech. The first method (FPM-NE) uses the FPM to produce a nonstationary noise estimate that can be used in any standard speech enhancement system. In this method, the FPM is used indirectly to perform speech enhancement. The second method we describe (FPM-SE) uses the FPM directly to perform speech enhancement. We present a study of the behavior of the two models on the standard Aurora 2 task, and demonstrate improvements of over 45% average word error rate reduction over the multi-style baseline.
- Ivan Tashev, Jasha Droppo, Michael Seltzer, and Alex Acero, Robust Design of Wideband Loudspeaker Arrays, in Proc. of International Conference on Audio, Speech and Signal Processing, Institute of Electrical and Electronics Engineers, Inc., Las Vegas, USA, April 2008.
Loudspeaker arrays usually are used in professional sound reinforcement systems to provide uniform sound coverage of the listening area. They can also be used for focusing the sound to the user for semi-private communication and reducing the overall noise pollution. In this paper we describe a procedure for designing broadband beamformers for loudspeaker arrays that is robust to the manufacturing tolerances of the loudspeakers – the limiting factor for achieving high directivity. The designed beamformer is evaluated using simulations and measurements of actual loudspeaker array.
- Dong Yu, Li Deng, Jasha Droppo, Jian Wu, Yifan Gong, and Alex Acero, A Minimum Mean-Square-Error Noise Reduction Algorithm on Mel-Frequency Cepstra for Robust Speech Recognition, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., April 2008.
We present a non-linear feature-domain noise reduction algorithm based on the minimum mean square error (MMSE) criterion on Mel-frequency cepstra (MFCC) for environment-robust speech recognition. Distinguishing from the MMSE enhancement in log spectral amplitude proposed by Ephraim and Malah (E&M) , the new algorithm presented in this paper develops the suppression rule that applies to power spectral magnitude of the filter-banks’ outputs and to MFCC directly, making it demonstrably more effective in noise-robust speech recognition. The noise variance in the new algorithm contains a significant term resulting from instantaneous phase asynchrony between clean speech and mixing noise, missing in the E&M algorithm. Speech recognition experiments on the standard Aurora-3 task demonstrate a reduction of word error rate by 48% against the ICSLP02 baseline, by 26% against the cepstral mean normalization baseline, and by 13% against the conventional E&M log-MMSE noise suppressor. The new algorithm is also much more efficient than E&M noise suppressor since the number of the channels in the Mel-frequency filter bank is much smaller (23 in our case) than the number of bins in the FFT domain (256). The results also show that our algorithm performs slightly better than the ETSI AFE on the well-matched and mid-mismatched settings.
- Jasha Droppo and Alex Acero, Environmental Robustness, in Benesty, Sondhi, Huang (eds) Handbook of Speech Processing, Springer, 2008.
- Jasha Droppo and Alex Acero, A Fine Pitch Model for Speech, in Proc. Interspeech Conference, International Speech Communication Association, August 2007.
An accurate model for the structure of speech is essential to many speech processing applications, including speech en-hancement, synthesis, recognition, and coding. This paper ex-plores some deficiencies of standard harmonic methods of mod-eling voiced speech. In particular, they ignore the effect of fun-damental frequency changing within an analysis frame, and the fact that the fundamental frequency is not a continuously vary-ing parameter, but a side effect of a series of discrete events. We present an alternative, time-series based framework for modeling the voicing structure of speech called the fine pitch model. By precisely modeling the voicing structure, it can more accurately account for the content in a voiced speech segment. Index Terms: speech analysis, pitch estimation, fundamental
- Chris White, Jasha Droppo, Alex Acero, and Julian Odell, Maximum Entropy Confidence Estimation for Speech Recognition, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Hawaii, April 2007.
For many automatic speech recognition (ASR) applications, it is useful to predict the likelihood that the recognized string contains an error. This paper explores two modifications of a classic design. First, it replaces the standard maximum likelihood classifier with a maximum entropy classifier. The maximum entropy framework carries the dual advantages discriminative training and reasonable generalization. Second, it includes a number of alternative features. Our ASR system is heavily pruned, and often produces recognition lattices with only a single path. These alternate features are meant to serve as a surrogate for the typical features that can be computed from a rich lattice. We show that the maximum entropy classifier easily outperforms the standard baseline system, and the alternative features provide consistent gains for all of our test sets.
- Jasha Droppo and Alex Acero, Joint Discriminative Front End and Back End Training for Improved Speech Recognition Accuracy, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Toulouse, France, May 2006.
This paper presents a general discriminative training method for both the front end feature extractor and back end acoustic model of an automatic speech recognition system. The front end and back end parameters are jointly trained using the Rprop algorithm against a maximum mutual information (MMI) objective function. Results are presented on the Aurora 2 noisy English digit recognition task. It is shown that discriminative training of the front end or back end alone can improve accuracy, but joint training is considerably better.
- Ivan Tashev, Jasha Droppo, and Alex Acero, Suppression Rule for Speech Recognition Friendly Noise Suppressors, in Proceedings of Eight International Conference Digital Signal Processing and Applications DSPA’06, Moscow, Russia, March 2006.
- Jasha Droppo, Milind Mahajan, Asela Gunawardana, and Alex Acero, How to Train a Discriminative Front End with Stochastic Gradient Descent and Maximum Mutual Information, in Proc. of the IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., Puerto Rico, December 2005.
This paper presents a general discriminative training method for the front end of an automatic speech recognition system. The SPLICE parameters of the front end are trained using stochastic gradient descent (SGD) of a maximum mutual information (MMI) objective function. SPLICE is chosen for its ability to approximate both linear and non-linear transformations of the feature space. SGD is chosen for its simplicity of implementation. Results are presented on both the Aurora 2 small vocabulary task and the WSJ Nov-92 medium vocabulary task. It is shown that the discriminative front end is able to consistently increase system accuracy across different front end configurations and tasks.
- A. Subramanya, Z. Zhang, Z. Liu, Jasha Droppo, and Alex Acero, A Graphical Model for Multi-Sensory Speech Processing in Air-and-Bone Conductive Microphones, in Proc. of the Interspeech Conference, International Speech Communication Association, Lisbon, Portugal, September 2005.
In continuation of our previous work on using an air-and-boneconductive microphone for speech enhancement, in this paper we propose a graphical model based approach to estimating the clean speech signal given the noisy observations in the air sensor. We also show how the same model can be used as a speech/non-speech classifier. With the aid of MOS (mean opinion score) tests we show, that the performance of the proposed model is better in comparison to our previously proposed direct filtering algorithm.
- Jasha Droppo and Alex Acero, Maximum Mutual Information SPLICE Transform for Seen and Unseen Conditions, in Proc. Interspeech Conference, International Speech Communication Association, September 2005.
SPLICE is a front-end technique for automatic speech recog-nition systems. It is a non-linear feature space transformation meant to increase recognition accuracy. Our previous work has shown how to train SPLICE to perform speech feature en-hancement. This paper evaluates a maximum mutual informa-tion (MMI) based discriminative training method for SPLICE. Discriminative techniques tend to excel when the training and testing data are similar, and to degrade performance signifi-cantly otherwise. This paper explores both cases in detail us-ing the Aurora 2 corpus. The overall recognition accuracy of the MMI-SPLICE system is slightly better than the Advanced Front End standard from ETSI, and much better than previ-ous SPLICE training algorithms. Most notably, it achieves this without explicitly resorting to the standard techniques of envi-ronment modeling, noise modeling or spectral subtraction.
- Li Deng, J. Wu, Jasha Droppo, and Alex Acero, Analysis and Comparison of Two Speech Feature Extraction/Compensation Algorithms, in IEEE Signal Processing Letters, vol. 12, no. 6, pp. 477–480, Institute of Electrical and Electronics Engineers, Inc., June 2005.
Two feature extraction and compensation algorithms, feature-space minimum phone error (fMPE), which contributed to the recent significant progress in conversational speech recognition, and stereo-based piecewise linear compensation for environments (SPLICE), which has been used successfully in noise-robust speech recognition, are analyzed and compared. These two algorithms have been developed by very different motivations and been applied to very different speech-recognition tasks as well. While the mathematical construction of the two algorithms is ostensibly different, in this report, we establish a direct link between them. We show that both algorithms in the run-time operation accomplish feature extraction/compensation by adding a posterior-based weighted sum of “correction vectors,” or equivalently the column vectors in the fMPE projection matrix, to the original, uncompensated features. Although the published fMPE algorithm empirically motivates such a feature extraction operation as “a reasonable starting point for training,” our analysis proves that it is a natural consequence of the rigorous minimum mean square error (MMSE) optimization rule as developed in SPLICE. Further, we review and compare related speech-recognition results with the use of fMPE and SPLICE algorithms. The results demonstrate the effectiveness of discriminative training on the feature extraction parameters (i.e., projection matrix in fMPE and equivalently correction vectors in SPLICE). The analysis and comparison of the two algorithms provide useful insight into the strong success of fMPE and point to further algorithm improvement and extension.
- Li Deng, Jian Wu, Jasha Droppo, and Alex Acero, Dynamic Compensation of HMM Variances Using the Feature Enhancement Uncertainty Computed From a Parametric Model of Speech Distortion, in IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 412–421, Institute of Electrical and Electronics Engineers, Inc., May 2005.
This paper presents a new technique for dynamic, frame-by-frame compensation of the Gaussian variances in the hidden Markov model (HMM), exploiting the feature variance or uncertainty estimated during the speech feature enhancement process, to improve noise-robust speech recognition. The new technique provides an alternative to the Bayesian predictive classification decision rule by carrying out an integration over the feature space instead of over the model-parameter space, offering a much simpler system implementation, lower computational cost, and dynamic compensation capabilities at the frame level. The computation of the feature enhancement variances is carried out using a probabilistic and parametric model of speech distortion, free from the use of any stereo training data. Dynamic compensation of the Gaussian variances in the HMM recognizer is derived, which is simply enlarging the HMM Gaussian variances by the feature enhancement variances. Experimental evaluation using the full Aurora2 test data sets demonstrates a significant digit error rate reduction, averaged over all noisy and signal-to-noise-ratio conditions, compared with the baseline that did not exploit the enhancement variance information. When the true enhancement variances are used, further dramatic error rate reduction is observed, indicating the strong potential for the new technique and the strong need for high accuracy in estimating the variances associated with feature enhancement. All the results, using either the true variances of the enhanced features or the estimated ones, show that the greatest contribution to recognizer’s performance improvement is due to the use of the uncertainty for the static features, next due to the delta features, and the least due to the delta–delta features.
- Z. Liu, A. Subramanya, Z. Zhang, Jasha Droppo, and Alex Acero, Leakage Model and Teeth Clack Removal for Air- and Bone-Conductive Integrated Microphones, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Philadelphia, March 2005.
Continuing our previous work [1, 2] on using air- and bone-conductive integrated microphones, and in particular on using the direct filtering approach  for speech enhancement in noisy environments, we present in this paper a refined version of the direct filtering algorithm. The new algorithm takes into account explicitly the leakage of background noise into the bone channel. We also present a new algorithm that detects and removes an artifact known as teeth clacks. Experiments show that the addition of the above algorithms improves system performance to a large extent even in highly nonstationary noisy environments.
- M. L. Seltzer, Alex Acero, and Jasha Droppo, Robust Bandwidth Extension of Noise-corrupted Narrowband Speech, in Proc. Interspeech Conference, International Speech Communication Association, 2005.
We present a new bandwidth extension algorithm for convert-ing narrowband telephone speech into wideband speech using a transformation in the mel cepstral domain. Unlike previous ap-proaches, the proposed method is designed specifically for band-width extension of narrowband speech that has been corrupted by environmental noise. We show that by exploiting previous re-search in mel cepstrum feature enhancement, we can create a uni-fied probabilistic framework under which the feature denoising and bandwidth extension processes are tightly integrated using a single shared statistical model. By doing so, we are able to both denoise the observed narrowband speech and robustly extend its bandwidth in a jointly optimal manner. A series of experiments on clean and noise-corrupted narrowband speech is performed to validate our approach.
- Zicheng Liu, Zhengyou Zhang, Alex Acero, Jasha Droppo, and Xuedong Huang, Direct Filtering for Air- and Bone-Conductive Microphones, in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., Siena, Italy, September 2004.
Air- and bone-conductive integrated microphones have been introduced by the authors [5,4] for speech enhancement in noisy environments. In this paper, we present a novel technique, called direct filtering, to combine the two channels from the air- and bone-conductive microphone for speech enhancement. Compared to the previous technique, the advantage of the direct filtering is that it does not require any training, and it is speaker independent. Experiments show that this technique effectively removes noises and significantly improves speech recognition accuracy even in highly non-stationary noisy environments.
- Jasha Droppo and Alex Acero, Noise Robust Speech Recognition with a Switching Linear Dynamic Model, in Proc. ICASSP, IEEE, Montreal, Canada, May 2004.
Model based feature enhancement techniques are constructed from acoustic models for speech and noise, together with a model of how the speech and noise produce the noisy observations. Most techniques incorporate either Gaussian mixture models (GMM) or hidden Markov models (HMM). This paper explores using a switching linear dynamic model (LDM) for the clean speech. The linear dynamics of the model capture the smooth time evolution of speech. The switching states of the model capture the piecewise stationary characteristics of speech. However, incorporating a switching LDM causes the enhancement problem to become intractable. With a GMM or an HMM, the enhancement running time is proportional to the length of the utterance. The switching LDM causes the running time to become exponential in the length of the utterance. To overcome this drawback, the standard generalized pseudo-Bayesian technique is used to provide an approximate solution of the enhancement problem. We present preliminary results demonstrating that, even with relatively small model sizes, substantial word error rate improvement can be achieved.
- Li Deng, Jasha Droppo, and Alex Acero, Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features, in IEEE Transactions on Speech and Audio Processing, vol. 12, no. 3, pp. 218–233, Institute of Electrical and Electronics Engineers, Inc., May 2004.
In this paper, we present a new algorithm for statistical speech feature enhancement in the cepstral domain. The algorithm exploits joint prior distributions (in the form of Gaussian mixture) in the clean speech model, which incorporate both the static and frame-differential dynamic cepstral parameters. Full posterior probabilities for clean speech given the noisy observation are computed using a linearized version of a nonlinear acoustic distortion model, and, based on this linear approximation, the conditional minimum mean square error (MMSE) estimator for the clean speech feature is derived rigorously using the full posterior. The final form of the derived conditional MMSE estimator is shown to be a weighted sum of three separate terms, and the sum is weighted again by the posterior for each of the mixture component in the speech model. The first of the three terms is shown to arrive naturally from the predictive mechanism embedded in the acoustic distortion model in absence of any prior information. The remaining two terms result from the speech model using only the static prior and only the dynamic prior, respectively. Comprehensive experiments are carried out using the Aurora2 database to evaluate the new algorithm. The results demonstrate significant improvement in noise-robust recognition accuracy by incorporating the joint prior for both static and dynamic parameter distributions in the speech model, compared with using only the static or dynamic prior and with using no prior.
- Li Deng, Jasha Droppo, and Alex Acero, Enhancement of log Mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise, in IEEE Transactions on Speech and Audio Processing, vol. 12, no. 2, pp. 133–143, Institute of Electrical and Electronics Engineers, Inc., March 2004.
This paper presents a novel speech feature enhancement technique based on a probabilistic, nonlinear acoustic environment model that effectively incorporates the phase relationship (hence phase sensitive) between the clean speech and the corrupting noise in the acoustic distortion process. The core of the enhancement algorithm is the MMSE (minimum mean square error) estimator for the log Mel power spectra of clean speech based on the phase-sensitive environment model, using highly efficient single-point, second-order Taylor series expansion to approximate the joint probability of clean and noisy speech modeled as a multivariate Gaussian. Since a noise estimate is required by the MMSE estimator, a high-quality, sequential noise estimation algorithm is also developed and presented. Both the noise estimation and speech feature enhancement algorithms are evaluated on the Aurora2 task of connected digit recognition. Noise-robust speech recognition results demonstrate that the new acoustic environment model which takes into account the relative phase in speech and noise mixing is superior to the earlier environment model which discards the phase under otherwise identical experimental conditions. The results also show that the sequential MAP (maximum a posteriori) learning for noise estimation is better than the sequential ML (maximum likelihood) learning, both evaluated under the identical phase-sensitive MMSE enhancement condition.
- Li Deng, Ye-Yi Wang, Kuansan Wang, Alex Acero, Hsiao Hon, Jasha Droppo, C. Boulis, Derek Jacoby, Milind Mahajan, Ciprian Chelba, and Xuedong Huang, Speech and language processing for multimodal human-computer interaction (Invited Article) , in Journal of VLSI Signal Processing Systems (Special issue on Real-World Speech Processing), vol. 36, no. 2-3, pp. 161 - 187, Kluwer Academic , 2004.
In this paper, we describe our recent work at Microsoft Research, in the project codenamed Dr. Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present in detail MiPad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones. Despite its current incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our user study reported in this paper. We describe in this system-oriented paper the main components of MiPad, with a focus on the robust speech processing and spoken language understanding aspects. The detailed MiPad components discussed include: distributed speech recognition considerations for the speech processing algorithm design; a stereo-based speech feature enhancement algorithm used for noise-robust front-end speech processing; Aurora2 evaluation results for this front-end processing; speech feature compression (source coding) and error protection (channel coding) for distributed speech recognition in MiPad; HMM-based acoustic modeling for continuous speech recognition decoding; a unified language model integrating context-free grammar and N-gram model for the speech decoding; schema-based knowledge representation for the MiPad's personal information management task; a unified statistical framework that integrates speech recognition, spoken language understanding and dialogue management; the robust natural language parser used in MiPad to process the speech recognizer's output; a machine-aided grammar learning and development used for spoken language understanding for the MiPad task; Tap & Talk multimodal interaction and user interface design; back channel communication and MiPad's error repair strategy; and finally, user study results that demonstrate the superior throughput achieved by the Tap & Talk multimodal interaction over the existing pen-only PDA interface. These user study results highlight the crucial role played by speech in enhancing the overall user experience in MiPad-like human-computer interaction devices.
- Zhengyou Zhang, Z. Liu, M. Sinclair, A. Acero, Li Deng, J. Droppo, Xuedong Huang, and Yanli Zheng, Multisensory microphones for robust speech detection, enhancement, and recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, May 2004, IEEE, 2004.
- J. Wu, Jasha Droppo, Li Deng, and Alex Acero, A Noise-Robust ASR Front-End Using Wiener Filter Constructed from MMSE Estimation of Clean Speech and Noise, in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., U.S. Virgin Islands, December 2003.
In this paper, we present a novel two-stage framework of designing a noise-robust front-end for automatic speech recognition. In the first stage, a parametric model of acoustic distortion is used to estimate the clean speech and noise spectra in a principled way so that no heuristic parameters need to set manually. To reduce possible flaws caused by the simplifying assumptions in the parametric model, a second-stage Wiener filtering is applied to further reduce the noise while preserving speech spectra unharmed. This front-end is evaluated on the Aurora2 task. For the multi-condition training scenario, a relative error reduction of 28.4% is achieved.
- Y. Zheng, Z. Liu, Z. Zhang, M. Sinclair, Jasha Droppo, Li Deng, Xuedong Huang, and Alex Acero, Air and Bone-Conductive Integrated Microphones for Robust Speech Detection and Enhancement, in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., U.S. Virgin Islands, December 2003.
We present a novel hardware device that combines a regular microphone with a bone-conductive microphone. The device looks like a regular headset and it can be plugged into any machine with a USB port. The bone-conductive microphone has an interesting property: it is insensitive to ambient noise and captures the low frequency portion of the speech signals. Thanks to the signals from the bone-conductive microphone, we are able to detect very robustly whether the speaker is talking, eliminating more than 90% of background speech. Furthermore, by combining both channels, we are able to significantly remove background speech even when the background speaker speaks at the same time as the speaker wearing the headset.
- Li Deng, Jasha Droppo, and Alex Acero, Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition, in IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 568–580, Institute of Electrical and Electronics Engineers, Inc., November 2003.
We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.
- Jasha Droppo, Li Deng, and Alex Acero, A Comparison of Three Non-Linear Observation Models for Noisy Speech Features, in Proc. Eurospeech Conference, International Speech Communication Association, Geneva, Switzerland, September 2003.
This paper reports our recent efforts to develop a uni£ed, non-linear, stochastic model for estimating and removing the effects of additive noise on speech cepstra. The complete system consists of prior models for speech and noise, an observation model, and an inference algorithm. The observation model quantifies the relationship between clean speech, noise, and the noisy observation. Since it is expressed in terms of the log mel-frequency filter-bank features, it is non-linear. The inference algorithm is the procedure by which the clean speech and noise are estimated from the noisy observation. The most critical component of the system is the observation model. This paper derives a new approximation strategy and compares it with two existing approximations. It is shown that the new approximation uses half the calculation, and produces equivalent or improved word accuracy scores, when compared to previous techniques. We present noise-robust recognition results on the standard Aurora 2 task.
- Mike Seltzer, Jasha Droppo, and Alex Acero, A Harmonic-Model-Based Front End for Robust Speech Recognition, in Proc. Eurospeech Conference, International Speech Communication Association, Geneva, Switzerland, September 2003.
- Li Deng, Jasha Droppo, and Alex Acero, Incremental Bayes Learning with Prior Evolution for Tracking Non-Stationary Noise Statistics from Noisy Speech Data, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Hong Kong, April 2003.
In this paper, a new approach to sequential estimation of the time-varying prior parameters of nonstationary noise is presented using the log-spectral or cepstral data of the corrupted noisy speech. Incremental Bayes learning is developed to provide a basis for noise prior evolution, recursively updating the noise prior statistics (mean and variance) using the approximate Gaussian posterior computed at the preceding time step. The algorithm for noise prior evolution is derived in detail, and is evaluated using the Aurora2 database with the root-mean-square (RMS) error measure. Experimental results show that when the time-varying variance and mean of the nonstationary noise prior are estimated and exploited, superior performance is achieved compared with using either no noise prior information or using the time-invariant, fixed mean and variance in the noise prior distribution.
- Li Deng, Alex Acero, Ye-Yi Wang, Kuansan Wang, Hsiao-Wuen Hon, Jasha Droppo, Milind Mahajan, and XD Huang, A speech-centric perspective for human-computer interface, in Proc. of the IEEE Fifth Workshop on Multimedia Signal Processing, Institute of Electrical and Electronics Engineers, Inc., December 2002.
Speech technology has been playing a central role in enhancing human-machine interactions, especially for small devices for which CUI has obvious limitations. The speechcentric perspective for hnman-compnter interface advanced in this paper derives from the view that speech is the only natural and expressive modality to enable people to access information from and to interact with any device. In this paper, we describe the work conducted at Microsoft Research, in the project codenamed &.Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present MiF'ad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fuUy integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones.
- Li Deng, Jasha Droppo, and Alex Acero, Exploiting Variances in Robust Feature Extraction Based on a Parametric Model of Speech Distortion, in Proc. International Conference on Spoken Language Processing, Denver, Colorado, September 2002.
This paper presents a technique that exploits the denoised speech’s variance, estimated during the speech feature enhancement process, to improve noise-robust speech recognition. This technique provides an alternative to the Bayesian predictive classification decision rule by carrying out an integration over the feature space instead of over the model-parameter space, offering a much simpler system implementation and lower computational cost. We extend our earlier work by using a new approach, based on a parametric model of speech distortion and thus free from the use of any stereo training data, to statistical feature enhancement, for which a novel algorithm for estimating the variance of the enhanced speech features is developed. Experimental evaluation using the full Aurora2 test data sets demonstrates an 11.4% digit error rate reduction averaged over all noisy and SNR conditions, compared with the best technique we have developed  prior to this work that did not exploit the variance information and that required no stereo training data.
- Jasha Droppo, Li Deng, and Alex Acero, Evaluation of SPLICE on the Aurora 2 and 3 Tasks, in Proc. International Conference on Spoken Language Processing, International Speech Communication Association, Denver, Colorado, September 2002.
Stereo-based Piecewise Linear Compensation for Environments (SPLICE) is a general framework for removing distortions from noisy speech cepstra. It contains a non-parametric model for cepstral corruption, which is learned from two channels of training data. We evaluate SPLICE on both the Aurora 2 and 3 tasks. These tasks consist of digit sequences in five European languages. Noise corruption is both synthetic (Aurora 2) and realistic (Aurora 3). For both the Aurora 2 and 3 tasks, we use the same training and testing procedure provided with the corpora. By holding the back-end constant, we ensure that any increase in word accuracy is due to our front-end processing techniques. In the Aurora 2 task, we achieve a 76.86% average decrease in word error rate with clean acoustic models, and an overall improvement of 62.63%. For the Aurora 3 task, we achieve a 75.06% average decrease in word error rate for the high-mismatch experiment, and an overall improvement of 47.19%.
- Jasha Droppo, Alex Acero, and Li Deng, A Nonlinear Observation Model for Removing Noise from Corrupted Speech Log Mel-Spectral Energies, in Proc. International Conference on Spoken Language Processing, Denver, Colorado, September 2002.
In this paper we present a new statistical model, which describes the corruption to speech recognition Mel-frequency spectral features caused by additive noise. This model explicitly represents the effect of unknown phase together with the unobserved clean speech and noise as three hidden variables. We use this model to produce noise robust features for automatic speech recognition. The model is constructed in the log Mel-frequency feature domain. In addition to being linearly related to MFCC recognition parameters, we gain the advantage of low dimensionality and independence of the corruption across feature dimensions. We illustrate the surprising result that, even when the true noise Mel-frequency spectral feature is known, the traditional spectral subtraction formula is flawed. We show the new model can be used to derive a spectral subtraction formula which produces superior error rate results, and is less sensitive to tuning parameters. Finally, we present results demonstrating that the new model is more general than spectral subtraction, and can take advantage of a prior noise estimate to produce robust features, rather than relying on point estimates of noise.
- Li Deng, Jasha Droppo, and Alex Acero, Log-Domain Speech Feature Enhancement Using Sequential MAP Noise Estimation and a Phase-sensitive Model of the Acoustic Environment, in Proc. International Conference on Spoken Language Processing, Denver, Colorado, September 2002.
In this paper we present an MMSE (minimum mean square error) speech feature enhancement algorithm, capitalizing on a new probabilistic, nonlinear environment model that effectively incorporates the phase relationship between the clean speech and the corrupting noise in acoustic distortion. The MMSE estimator based on this phase-sensitive model is derived and it achieves high efficiency by exploiting single-point Taylor series expansion to approximate the joint probability of clean and noisy speech as a multivariate Gaussian. As an integral component of the enhancement algorithm, we also present a new sequential MAP-based nonstationary noise estimator. Experimental results on the Aurora2 task demonstrate the importance of exploiting the phase relationship in the speech corruption process captured by the MMSE estimator. The phasesensitive MMSE estimator reported in this paper performs significantly better than phase-insensitive spectral subtraction (54% error rate reduction), and also noticeably better than a phase-insensitive MMSE estimator as our previous state-of-the-art technique reported in  (7% error rate reduction), under otherwise identical experimental conditions of speech recognition.
- Jasha Droppo, Li Deng, and Alex Acero, Uncertainty Decoding with SPLICE for Noise Robust Speech Recognition, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Florida, May 2002.
Speech recognition front end noise removal algorithms have, in the past, estimated clean speech features from corrupted speech features. The accuracy of the noise removal process varies from frame to frame, and from dimension to dimension in the feature stream, due in part to the instantaneous SR of the input. In this paper, we show that localized knowledge of the accuracy of the noise removal process can be directly incorporated into the Gaussian evaluation within the decoder, to produce higher recognition accuracies. To prove this concept, we modify the SPLICE algorithm to output uncertainty information, and show that the combination of SPLICE with uncertainty decoding can remove 74.2% of the errors in a subset of the Aurora2 task.
- Li Deng, Jasha Droppo, and Alex Acero, A Bayesian Approach to Speech Feature Enhancement using the Dynamic Cepstral Prior, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Florida, May 2002.
A new Bayesian estimation framework for statistical feature extraction in the form of cepstral enhancement is presented, in which the joint prior distribution is exploited for both static and frame-differential dynamic cepstral parameters in the clean speech model. The conditional minimum mean square error (MMSE) estimator for the clean speech feature is derived using the full posterior probability for clean speech given the noisy observation. The final form of the estimator (for each mixture component) is a weighted sum of the prior information using the static and the dynamic priors separately, and of the prediction using the acoustic distortion model in absence of any prior information. Comprehensive noiserobust speech recognition experiments using the Aurora2 database demonstrate significant improvement in accuracy by incorporating the joint prior, compared with using only the static or dynamic prior and with using no prior.
- Li Deng, Kuansan Wang, Alex Acero, Hsiao-Wuen Hon, Jasha Droppo, Constantinos Boulis, Ye-Yi Wang, Derek Jacoby, Milind Mahajan, Ciprian Chelba, and Xuedong D. Huang, Distributed Speech Processing in MiPad’s Multimodal User Interface, in IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol. 10, no. 8, pp. 605-619, Institute of Electrical and Electronics Engineers, Inc., 2002.
This paper describes the main components of MiPad (Multimodal Interactive PAD) and especially its distributed speech processing aspects. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution for data entry in PDAs or smart phones, often done by pecking with tiny styluses or typing on minuscule keyboards. Our user study indicates that the throughput of MiPad is significantly superior to that of the existing pen-based PDA interface. Acoustic modeling and noise robustness in distributed speech recognition are key components in MiPad’s design and implementation. In a typical scenario, the user speaks to the device at a distance so that he or she can see the screen. The built-in microphone thus picks up a lot of background noise, which requires MiPad be noise robust. For complex tasks, such as dictating e-mails, resource limitations demand the use of a client–server (peer-to-peer) architecture, where the PDA performs primitive feature extraction, feature quantization, and error protection, while the transmitted features to the server are subject to further speech feature enhancement, speech decoding and understanding before a dialog is carried out and actions rendered. Noise robustness can be achieved at the client, at the server or both. Various speech processing aspects of this type of distributed computation as related to MiPad’s potential deployment are presented in this paper. Recent user interface study results are also described. Finally, we point out future research directions as related to several key MiPad functionalities.
- Li Deng, Jasha Droppo, and Alex Acero, Recursive Noise Estimation Using Iterative Stochastic Approximation for Stereo-based Robust Speech Recognition, in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., Madonna di Campliglio, Italy, December 2001.
We present an algorithm for recursive estimation of parameters in a mildly nonlinear model involving incomplete data. In particular, we focus on the time-varying deterministic parameters of additive noise in the nonlinear model. For the nonstationary noise that we encounter in robust speech recognition, different observation data segments correspond to different noise parameter values. Hence, recursive estimation algorithms are more desirable than batch algorithms, since they can be designed to adaptively track the changing noise parameters. One such design based on the iterative stochastic approximation algorithm in the recursive-EM framework is described in this paper. This new algorithm jointly adapts time-varying noise parameters and the auxiliary parameters introduced to linearly approximate the nonlinear model. We present stereo-based robust speech recognition results for the AURORA task, which demonstrate the effectiveness of the new algorithm compared with a more traditional, MMSE noise estimation technique under otherwise identical experimental conditions.
- Jasha Droppo, Alex Acero, and Li Deng, Evaluation of the SPLICE Algorithm on the Aurora 2 Database, in Proc. Eurospeech Conference, International Speech Communication Association, Aalbodk, Denmark, September 2001.
This paper describes recent improvements to SPLICE, Stereobased Piecewise Linear Compensation for Environments, which produces an estimate of cepstrum of undistorted speech given the observed cepstrum of distorted speech. For distributed speech recognition applications, SPLICE can be placed at the server, thus limiting the processing that would take place at the client. We evaluated this algorithm on the Aurora2 task, which consists of digit sequences within the TIDigits database that have been digitally corrupted by passing them through a linear filter and/or by adding different types of realistic noises at SNRs ranging from 20dB to -5dB. On set A data, for which matched training data is available, we achieved a 66% decrease in word error rate over the baseline system with clean models. This preliminary result is of practical significance because in a server implementation, new noise conditions can be added as they are identified once the service is running.
- Li Deng, Alex Acero, L. Jiang, Jasha Droppo, and Xuedong Huang, High-Performance Robust Speech Recognition Using Stereo Training Data, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Salt Lake City, Utah, May 2001.
We describe a novel technique of SPLICE for high performance robust speech recognition. It is an efficient noise reduction and channel distortion compensation technique that makes effective use of stereo training data. In this paper, we present a version of SPLICE using the minimum mean square error decision, and describe an extension by training clusters of HMMs with SPLICE processing. Comprehensive results using a Wall Street Journal large vocabulary recognition task and with a wide range of noise types demonstrate superior performance of the SPLICE technique over that under noisy matched conditions (13% word error rate reduction). The new technique is also shown to consistently outperform the spectralsubtraction and the fixed CDCN noise reduction techniques. It is currently being integrated into the Microsoft MiPad, a new generation PDA prototype.
- Jasha Droppo, Alex Acero, and Li Deng, Efficient Online Acoustic Environment Estimation for FCDCN in a Continuous Speech Recognition System, in Proc. ICASSP, Institute of Electrical and Electronics Engineers, Inc., Salt Lake City, Utah, May 2001.
There exists a number of cepstral de-noising algorithms which perform quite well when trained and tested under similar acoustic environments, but degrade quickly under mismatched conditions. We present two key results that make these algorithms practical in real noise environments, with the ability to adapt to different acoustic environments over time. First, we show that it is possible to leverage the existing de-noising computations to estimate the acoustic environment on-line and in real time. Second, we show that it is not necessary to collect large amounts of training data in each environment–clean data with artificial mixing is sufficient. When this new method is used as a pre-processing stage to a large vocabulary speech recognition system, it can be made robust to a wide variety of acoustic environments. With synthetic training data, we are able to reduce the word error rate by 27%.
- Xuedong Huang, Alex Acero, C. Chelba, Li Deng, Jasha Droppo, D. Duchene, J. Goodman, Hsiao-Wuen Hon, D. Jacoby, L. Jiang, R. Loynd, Milind Mahajan, P. Mau, S. Meredith, S. Mughal, S. Neto, M. Plumpe, K. Stery, G. Venolia, Kuansan Wang, and Ye-Yi Wang, MIPAD: A Multimodal Interactive Prototype, in International Conference on Acoustics, Speech, and Signal Processing, Institute of Electrical and Electronics Engineers, Inc., Salt Lake City, Utah, USA, 2001.
Dr. Who is a Microsoft’s research project aiming at creating a speech-centric multimodal interaction framework, which serves as the foundation for the .NET natural user interface. MiPad is the application prototype that demonstrates compelling user advantages for ireless Personal Digital Assistant (PDA) devices, MiPad fully integrates continuous speech recognition (CSR) and spoken la nguage understanding (SLU) to enable users to accomplish many common tasks using a multimodal interface and wireless technologies. It tries to solve the problem of pecking with tiny styluses or typing on minuscule keyboards in today’s PDAs. Unlike a cellular phone, MiPad avoids speech-only interaction. It incorporates a built-in microphone that activates whenever a field is selected. As a user taps the screen or uses a built-in roller to navigate, the tapping action narrows the number of possible instructions for spoken understanding. MiPad currently runs on a Windows CE Pocket PC with a Windows 2000 machine where speech recognition is performed. The Dr Who CSR ngine uses a unified CFG and n -gram language model. The Dr Who SLU engine is based on a robust char t parser and a plan-based dialog manager. This paper discusses MiPad’s design, implementation work in progress, and preliminary user study in comparison to the existing pen-based PDA interface.
- Jasha Droppo and Alex Acero, Maximum a Posteriori Pitch Tracking, in Proc. International Conference on Spoken Language Processing, International Speech Communication Association, Sydney, Australia, December 1998.
A Maximum a posteriori framework for computing pitch tracks as well as voicing decisions is presented. The proposed algorithm consists of creating a time-pitch energy distribution based on predictable energy that improves on the normalized cross-correlation. A large database is used to evaluate the algorithm’s performance against two standard solutions, using glottal closure instants (GCI) obtained from electroglottogram (EGG) signals as a reference. The new MAP algorithm exhibits higher pitch accuracy and better voiced/unvoiced discrimination.
Last modified: Thursday, February 24, 2005
E-mail: jdroppo at microsoft dot com
U.S.Mail: Microsoft Corporation, One Microsoft Way, Redmond WA, 98052, USA
Tel: (425) 703-7114
Fax: (425) 706-7329