L. Lee, P. Fleguth, and Li Deng
This paper introduced a new speech production model aiming at synthesizing natural speech in real-time by modeling the key dynamic properties of the articulators in a nonlinear state-space framework. The goal-oriented movement of the tongue tip, tongue dorsum, upper lip, lower lip and jaw are described in a linear state equation. The so produced articulatory trajectories combined with the effects of velum and larynx are mapped into acoustic features in the nonlinear observation equation. The input and output of the model are time-aligned phone sequence and speech waveform respectively. This speech production model can also be directly applied to speech recognition to better account for coarticulation and phonetic reduction phenomenon with considerably less parameters than the traditional HMM based approaches.
|Published in||Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing|