Hidden Dynamic Models for Speech Processing Applications

  • Leo Jingyu Lee

MSR-TR-2004-151 |

Proceedings of Ninth Conference on Uncertainty in Artificial Intelligence, Washington, DC

Human speech has a dual nature: the goal of speech is to convey discrete linguistic symbols corresponding to the intended message while the actual speech signal is produced by the continuous and smooth movement of the articulators with rich temporal structures. Such a dual nature has been amazingly utilized by humans in a beneficial way but has presented a big challenge for both speech science and speech technology.

This thesis starts with the observation that the continuous or dynamic aspect of human speech is inadequately modeled in current speech technology, especially in state-of-the-art speech recognition systems, while much could be learned from recent advances in speech science. This motivates a study of articulatory dynamics, based on a recently available large scale speech production database that provides simultaneous acoustic and articulatory measurements. Indeed many insights and valuable experiences have been gained from such a study and, as a result, a hidden dynamic model (HDM) that gracefully integrates the discrete and continuous nature of speech is proposed. But it also turns out that articulatory dynamics is highly complicated and can not be captured by simple models, thus the dynamics are very difficult to put into an efficient computational framework for use in speech technology.

As a continuing effort to seek internal dynamics of human speech that can reflect the continuous shape change of the vocal tract and benefit the current speech technology, the second part of the thesis turns to a study of vocal-tract-resonance (VTR) dynamics, built upon the insights and experiences gained from studying articulatory dynamics. It verifies that VTR dynamics can be captured by simple dynamic equations, and a highly accurate and efficient piecewise linear mapping from VTR dynamics to the acoustic space is also carefully designed. Two novel VTR tracking methods are developed in this part: one is based on mimicking manual tracking of VTR dynamics by human experts and uses advanced image processing methods (active contours), the other is the natural outcome of formulating a HDM for VTR dynamics and recovering the hidden dynamics by Kalman smoothing. The residual feature resulting from VTR tracking by HDM has also been used as an appended acoustic feature to improve a hidden Markov model (HMM) based phone recognizer on the TIMIT database.

The final part of the thesis is dedicated to arguably the most difficult and comprehensive speech processing application: automatic speech recognition (ASR). It first casts the HDM formulated for speech application under the general framework of probabilistic graphical models in machine learning. However, it also becomes clear that exact inference and parameter learning for such a model is NP hard. In order to use HDM for speech recognition, this final part concentrates on developing novel and powerful variational EM algorithms. The effectiveness of the new algorithms invented has been demonstrated by extensive simulation experiments, and special concerns for speech recognition are also discussed