Xiandan Zhuang, Lijuan Wang, Frank Soong, and Mark Hasegawa-Johnson
22 September 2010
High quality speech-to-lips conversion, investigated in this work, renders
realistic lips movement (video) consistent with input speech (audio)
without knowing its linguistic content. Instead of memoryless framebased
conversion, we adopt maximum likelihood estimation of the visual
parameter trajectories using an audio-visual joint Gaussian Mixture
Model (GMM). We propose a minimum converted trajectory error approach
(MCTE) to further refine the converted visual parameters. First,
we reduce the conversion error by training the joint audio-visual GMM
with weighted audio and visual likelihood. Then MCTE uses the generalized
probabilistic descent algorithm to minimize a conversion error
of the visual parameter trajectories defined on the optimal Gaussian kernel
sequence according to the input speech. We demonstrate the effectiveness
of the proposed methods using the LIPS 2009 Visual Speech
Synthesis Challenge dataset, without knowing the linguistic (phonetic)
content of the input speech.
![]() PDF file |
In INTERSPEECH 2010
Publisher International Speech Communication Association
| Type | Inproceedings |