*
Quick Links|Home|Worldwide
Microsoft*
Search for


Speech Technology Home


Acoustic Modeling and EARS Research


The ultimate challenge for speech recognition is to make it indistinguishable from humans speech perception system. At present, when users interact with any existing speech recognition system, they have to be fully aware of the fact that their conversation partner is a machine. The machine would easily break if the users were to speak in a casual and natural style as if they were talking with a friend. In order to enable mainstream use of speech recognition, naturalness or free style of speaking on the users part should not incur so many recognition errors that make the recognition systems unusable as it is the case today. In fact, after the problem of free-style speech recognition or transcription becomes solved, many kinds of killer applications will be possible that are currently unimaginable. By that time, pervasive adaptation of speech recognition will become natural.

It has been broadly hypothesized that new computational paradigms beyond the conventional HMM are needed to reach the goal of all-purpose recognition technology for unconstrained, natural human-human and human-machine speech, and that statistical models capitalizing on essential properties of the structures of natural speech and language are beneficial in establishing such paradigms. To pursue the research along this direction, in early 2002 we submitted a proposal to the EARS program sponsored by DARPA that details a novel approach to speech recognition based on structured speech models incorporating hidden speech dynamics. The thrust of the research described in this proposal is to pursue an approach based on the above hypothesis using advanced learning techniques, and to integrate the results of this investigation into state-of-the-art speech recognition systems that may be used for effective, high-performance speech to text. The common thread tying the various aspects of research described in this proposal is the powerful statistical modeling techniques that are capable of effectively and parsimoniously characterizing long-span dependency properties in natural human speech. We are currently developing several versions of a statistical generative model, called the structured speech model, which captures the internal (hidden) structure in the target-directed speech dynamics flowing naturally from one speech unit to another at the sentence level. This hidden structure represents some essential dynamic properties of natural speech articulation but it can be inferred automatically from transcribed acoustic data. The structured speech model can be effectively used in the data generation mode to enhance the HMM system, and can further be used to rapidly adapt the HMM parameters. It can also be used to perform conversational speech decoding directly.

One set of initial work has been carried out by constructing an analytical function that provides accurate nonlinear mapping from the vocal tract resonances (frequencies and bandwidths) to the acoustic features that go into the recognizer. A preliminary test for the viability of this approach is to do inverse mapping to recover the vocal tract resonances from the observed acoustic features. We have succeeded in this test, and have built a demo system.

Central to our EARS research are a class of structured speech models (SSM) and the associated recognizer architectures designed to effectively deal with the variability in conversational speech. We have developed three specific SSM-based architecures designed for speech recognition with balanced tradeoff among modeling accuracy, training/decoding complexity, and implementation simplicity. They include: piecewise-linear state-space modeling, hidden dynamic modeling, and hidden dynamic discretization, each resulting from different ways of simplifying the general structure of the SSM. Preliminary experiments and results obtained so far on several standard databases (TIDigits, TIMIT, and Switchboard) have been described in various publications below, and in our presentations at the EARS meetings and tecchnical project reports to DARPA.

L. Deng, D. Yu, and A. Acero. “Structured speech modeling,” (invited) IEEE Trans- actions on Speech and Audio Processing (Special Issue on Rich Transcription), Vol. 14, No. 5, Sept 2006, pp. 1492-1504.

D. Yu, L. Deng, and A. Acero. “A lattice search technique for long-contextual-span hidden trajectory model of speech,” Speech Communication, Vol. 48, 2006, pp. 1214- 1226.

L. Deng, A. Acero, and I. Bazzi. “Tracking vocal tract resonances using a quantized nonlinear function embedded in a temporal constraint,” IEEE Transactions on Speech and Audio Processing, Vol. 14, No. 2, March 2006, pp. 425-434.

L. Deng, D. Yu, and A. Acero. “A bi-directional target-filtering model of speech coarticulation and reduction: Two-stage implementation for phonetic recognition,” IEEE Transactions on Speech and Audio Processing, Vol. 14, No. 1, January 2006, pp. 256-265.

F. Seide, J.L. Zhou, and L. Deng. "Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM --- MAP decoding and evaluation," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, April 2003, Vol.I, pp.\ 748-751.

J.L. Zhou, F. Seide, and L. Deng. "Coarticulation modeling by embedding a target-directed hidden trajectory model into HMM --- Models and training," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, April 2003, Vol.I, pp.\ 744-747.

L.J. Lee, H. Attias, and L. Deng. "Variational inference and learning for segmental switching state space models of hidden speech dynamics," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, April 2003, Vol.I, pp.\ 920-923.

I. Bazzi, A. Acero, and L. Deng. "An expectation-maximization approach for formant tracking using a parameter-free non-linear predictor," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, April 2003, Vol.I, pp.\ 464-467.


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement