J. Ma and Li Deng
January 2004
In this paper, a novel mixture linear dynamic model
(MLDM) for speech recognition is developed and evaluated, where
several linear dynamic models are combined (mixed) to represent
different vocal-tract-resonance (VTR) dynamic behaviors and
the mapping relationships between the VTRs and the acoustic
observations. Each linear dynamic model is formulated as the
state-space equations, where the VTRs target-directed property
is incorporated in the state equation and a linear regression
function is used for the observation equation that approximates
the nonlinear mapping relationship. A version of the generalized
EM algorithm is developed for learning the model parameters,
where the constraint that the VTR targets change at the segmental
level (rather than at the frame level) is imposed in the parameter
learning and model scoring algorithms. Speech recognition experiments
are carried out to evaluate the new model using the N-best
re-scoring paradigm in a Switchboard task. Compared with a
baseline recognizer using the triphone HMM acoustic model,
the new recognizer demonstrated improved performance under
several experimental conditions. The performance was shown to
increase with an increased number of the mixture components in
the model.
![]() PDF file |
In: IEEE Trans. on Speech and Audio Processing
| Type: | Article |
| Pages: | 47-58 |
| Volume: | 12 |
| Number: | 1 |