Hidden Dynamic Models for Speech Processing Applications

Leo Jingyu Lee

Hidden Dynamic Models for Speech Processing Applications

Leo Jingyu Lee

MSR-TR-2004-151 | August 2004

Proceedings of Ninth Conference on Uncertainty in Artificial Intelligence, Washington, DC

Download BibTex

Human speech has a dual nature: the goal of speech is to convey discrete linguistic symbols corresponding to the intended message while the actual speech signal is produced by the continuous and smooth movement of the articulators with rich temporal structures. Such a dual nature has been amazingly utilized by humans in a beneﬁcial way but has presented a big challenge for both speech science and speech technology.

This thesis starts with the observation that the continuous or dynamic aspect of human speech is inadequately modeled in current speech technology, especially in state-of-the-art speech recognition systems, while much could be learned from recent advances in speech science. This motivates a study of articulatory dynamics, based on a recently available large scale speech production database that provides simultaneous acoustic and articulatory measurements. Indeed many insights and valuable experiences have been gained from such a study and, as a result, a hidden dynamic model (HDM) that gracefully integrates the discrete and continuous nature of speech is proposed. But it also turns out that articulatory dynamics is highly complicated and can not be captured by simple models, thus the dynamics are very diﬃcult to put into an eﬃcient computational framework for use in speech technology.

As a continuing eﬀort to seek internal dynamics of human speech that can reﬂect the continuous shape change of the vocal tract and beneﬁt the current speech technology, the second part of the thesis turns to a study of vocal-tract-resonance (VTR) dynamics, built upon the insights and experiences gained from studying articulatory dynamics. It veriﬁes that VTR dynamics can be captured by simple dynamic equations, and a highly accurate and eﬃcient piecewise linear mapping from VTR dynamics to the acoustic space is also carefully designed. Two novel VTR tracking methods are developed in this part: one is based on mimicking manual tracking of VTR dynamics by human experts and uses advanced image processing methods (active contours), the other is the natural outcome of formulating a HDM for VTR dynamics and recovering the hidden dynamics by Kalman smoothing. The residual feature resulting from VTR tracking by HDM has also been used as an appended acoustic feature to improve a hidden Markov model (HMM) based phone recognizer on the TIMIT database.

The ﬁnal part of the thesis is dedicated to arguably the most diﬃcult and comprehensive speech processing application: automatic speech recognition (ASR). It ﬁrst casts the HDM formulated for speech application under the general framework of probabilistic graphical models in machine learning. However, it also becomes clear that exact inference and parameter learning for such a model is NP hard. In order to use HDM for speech recognition, this ﬁnal part concentrates on developing novel and powerful variational EM algorithms. The eﬀectiveness of the new algorithms invented has been demonstrated by extensive simulation experiments, and special concerns for speech recognition are also discussed