R. Togneri and Li Deng
October 2004
In this paper, we present a state-space formulation of a neuralnetwork-
based hidden dynamic model of speech whose parameters
are trained using an approximate EM algorithm. The training
makes use of the results of an off-the-shelf formant tracker (during
the vowel segments) to simplify the complex sufficient statistics
that would be required in the exact EM algorithm. The trained
model, consisting of the state equation for the target-directed vocal
tract resonance (VTR) dynamics on all classes of speech sounds
(including consonant closure) and the observation equation for
mapping from the VTR to acoustic measurement, is then used
to recover the unobserved VTR based on Extended Kalman Filter.
The results demonstrate accurate estimation of the VTRs, especially
those during rapic consonant-vowel or vowel-consonant
transitions and during consonant closure when the acoustic measurement
alone provides weak or no information to infer the VTR
values.
![]() PDF file |
In Proc. Int. Conf. on Spoken Language Processing
| Type | Inproceedings |