A Minimum Converted Trajectory Error (MCTE) Approach to High Quality Speech-to-Lips Conversion

Xiandan Zhuang, Lijuan Wang, Frank Soong, and Mark Hasegawa-Johnson

Abstract

High quality speech-to-lips conversion, investigated in this work, renders

realistic lips movement (video) consistent with input speech (audio)

without knowing its linguistic content. Instead of memoryless framebased

conversion, we adopt maximum likelihood estimation of the visual

parameter trajectories using an audio-visual joint Gaussian Mixture

Model (GMM). We propose a minimum converted trajectory error approach

(MCTE) to further refine the converted visual parameters. First,

we reduce the conversion error by training the joint audio-visual GMM

with weighted audio and visual likelihood. Then MCTE uses the generalized

probabilistic descent algorithm to minimize a conversion error

of the visual parameter trajectories defined on the optimal Gaussian kernel

sequence according to the input speech. We demonstrate the effectiveness

of the proposed methods using the LIPS 2009 Visual Speech

Synthesis Challenge dataset, without knowing the linguistic (phonetic)

content of the input speech.

Details

Publication typeInproceedings
Published inINTERSPEECH 2010
PublisherInternational Speech Communication Association
> Publications > A Minimum Converted Trajectory Error (MCTE) Approach to High Quality Speech-to-Lips Conversion