Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Voice Driven Talking Head

A New Language Independent, Photo-realistic Talking Head Driven by Voice Only

(Interspeech2013 submission)

 

Xinjian Zhang, Lijuan Wang, Gang Li, Frank Seide, Frank K. Soong

Abstract

We propose a new photo-realistic, voice driven only (i.e. no linguistic info of the voice input is needed) talking head. The core of the new talking head is a context-dependent, multi-layer, Deep Neural Network (DNN), which is discriminatively trained over hundreds of hours, speaker independent speech data. The trained DNN is then used to map acoustic speech input to 9,000 tied “senone” states probabilistically. For each photo-realistic talking head, an HMM-based lips motion synthesizer is trained over the speaker’s audio/visual training data where states are statistically mapped to the corresponding lips images. In test, for given speech input, DNN predicts the likely states in their posterior probabilities and photo-realistic lips animation is then rendered through the DNN predicted state lattice. The DNN trained on English, speaker independent data has also been tested with other language input, e.g. Mandarin, Spanish, etc. to mimic the lips movements cross-lingually. Subjective experiments show that lip motions thus rendered for 15 non-English languages are highly synchronized with the audio input and photo-realistic to human eyes perceptually.

Video Demo 1:

 

 

example 1 

example 2 

example 3 

example 4 

example 5 

the proposed method

(using DNN tied state decoding)

mp4 

mp4 

mp4 

mp4 

mp4 

refernce

(using ground truth phone sequence)

mp4 

mp4 

mp4 

mp4 

mp4 

 

Video Demo 2:

 

 

English(en-US) 

Chinese(zh-CN) 

Japanese(ja-JP) 

Spanish(es-ES) 

French(fr-FR) 

example 1 

mp4 

mp4 

mp4 

mp4 

mp4 

example 2

mp4 

mp4 

mp4 

mp4 

mp4