Target-Directed Mixture Dynamic Models for Spontaneous Speech Recognition

J. Ma and Li Deng


In this paper, a novel mixture linear dynamic model

(MLDM) for speech recognition is developed and evaluated, where

several linear dynamic models are combined (mixed) to represent

different vocal-tract-resonance (VTR) dynamic behaviors and

the mapping relationships between the VTRs and the acoustic

observations. Each linear dynamic model is formulated as the

state-space equations, where the VTRs target-directed property

is incorporated in the state equation and a linear regression

function is used for the observation equation that approximates

the nonlinear mapping relationship. A version of the generalized

EM algorithm is developed for learning the model parameters,

where the constraint that the VTR targets change at the segmental

level (rather than at the frame level) is imposed in the parameter

learning and model scoring algorithms. Speech recognition experiments

are carried out to evaluate the new model using the N-best

re-scoring paradigm in a Switchboard task. Compared with a

baseline recognizer using the triphone HMM acoustic model,

the new recognizer demonstrated improved performance under

several experimental conditions. The performance was shown to

increase with an increased number of the mixture components in

the model.


Publication typeArticle
Published inIEEE Trans. on Speech and Audio Processing
> Publications > Target-Directed Mixture Dynamic Models for Spontaneous Speech Recognition