Honors & Awards
- The 3D Photo-Real talking head project won “Demo of the Year”@2011 in MSRA, which is also shown at Craig Mundie’s Techforum 2011, Techfest 2011 (including public day), Exec Retreat 2011, MGX 2011, with great press coverage (MSNBC, PCWorld, CNET, The Seattle Times, etc.).
- Dictionary Talking Head is selected as MSR highlighted 18 “tech transfers” (e.g. significant product impact) of 2010 from the worldwide labs (reported by PCWorld).
- The Photo-Real talking head project won NO.1 in Audio-Visual consistency test in LIPS Challenge 2009, an international audio/visual lips rendering contest held in the AVSP Workshop.
2D Photo-Real Talking Head
We propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge.
An HMM-based Singing and Talking Head
This demo shows a trainable, Hidden Markov Model(HMM)-based, talking and singing head which can synthesize speech from a given text or singing voice from given lyrics and music scores (melody).
In training, audio/visual features along with the corresponding scripts (text or lyrics and melody) are used to train statistical HMMs where key features and their dynamics of basic audio/visual components are captured and parameterized statistically. In speech synthesis, a given text is first analyzed and decomposed into a sequence of phonemes along with their corresponding durations and f0 prosody. Thus generated speech parameter trajectories are then used to synthesize the final speech waveform. In singing voice synthesis, given lyrics and melody of a song is used to determine the pitch trajectory and phoneme durations and the information is used to drive the trained HMMs to synthesize a singing voice.
Since the HMMs are trained with a person's speech or a singer's voice data, personalized speech or singing voice can be optimally reproduced in the maximum likelihood sense. Head motions and synchronized lip-movements can be automatically synthesized with corresponding prosodic cues and viseme sequence and they can also be interactively modified.
- Demo 5: "You and Me" by cartoon talking head (girl).
- Demo 6: "You and Me" by cartoon talking head (boy).
- Demo 7: Self-introduction by cartoon talking head (boy).
Computerized Audio-Visual Language Learning
For foreign language users, learning correct pronunciation is considered by many to be of the most arduous of tasks if one does not have access to a personal tutor. The reason is that the most common method for learning pronunciation, that is, to listen and repeat audio tapes, has the following important deficiencies: completeness and engagement. Completeness, in that audio data alone does not offer users how to move their mouth/lips to sound out phonemes that are perhaps non-existent in their mother tongue. Also audio alone is less motivating/personalized for learners, and as supported by studies in Cognitive Informatics, information is processed by humans more efficiently as both audio and visual inform.
The ambition is to create a visualized language teacher that can be engaged in many aspects of language learning from detailed pronunciation training to conversational practice. An initial implementation is a photo-realistic talking head for pronunciation training by demonstrating highly precise lip-sync animation for any arbitrary text input. So that, ESL users can watch synthesized videos to learn how the mouth moves with speech in a lip-sync manner for many sample sentences on Bing Dictionary (Engkoo).
Live demo can be found on Bing dictionary (http://dict.bing.com.cn).
3D Photo-Real Talking Head
We propose a new 3D photo-real talking head with a personalized, photo realistic appearance. Different head motions and facial expressions can be freely controlled and rendered. It extends our prior, high-quality, 2D photo-real talking head to 3D.
Around 20-minutes of audio-visual 2D video are first recorded with read prompted sentences spoken by a speaker. We use a 2D-to-3D reconstruction algorithm to automatically wrap the 3D geometric mesh with 2D frames to construct a training database. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, speech-to-speech translation, etc.
- CNET News: Microsoft demos 3D photo avatars, display tech
- MSNBC Homepage Story: Realistic 3-D talking head made from 2-D video
- The Seattle Times: TechFest: Animating a 3-D photo avatar
A Multi-lingual, 3D Photo-realistic Talking Head
Speaking fluently a foreign language, without even attending a traditional or self-paced language course, is incredible if not impossible. In this demo, we create a talking head which can speak foreign languages. We use Chinese (to be learned) and English (native language) as the language pair to demonstrate this technology and authentic Chinese is spoken by an English speaker’s talking head lip-synchronously in the original speaker’s voice. The talking head and corresponding Mandarin TTS is trained with the English speaker’s audio/video recording. Two advanced technologies, 3D photo-realistic talking head and cross-lingual TTS (Text-to-Speech) synthesis, are combined seamlessly. The Mandarin Chinese TTS was trained with 1 hour of the speaker’s English data. The synthesized Chinese is then lip-synced with the English speaker’s 3D photo-realistic talking head, by matching corresponding inter-language lip articulations between the English speaker and a reference Chinese speaker. We predict trajectories of the talking head with a statistically trained Hidden Markov Model (HMM) and render natural facial expressions and lips movements time-synchronously with the corresponding speech. The prototype is useful for applications like speech-to-speech translation, voice agents, gaming, and tele-presence and computer assisted language learning.
A New Language Independent, Photo-realistic Talking Head Driven by Voice Only
We present a high-fidelity, speech-to-lips conversion talking head with no linguistic knowledge of input speech. A context-dependent, multi-layer, Deep Neural Network (DNN) is first trained with error back-propagation procedure over thousands hours of speaker independent data. A highly discriminative mapping between acoustic speech input and 9k tied states is thus established. Additionally, an HMM-based lips motion synthesizer is trained over a speaker’s audio/visual data and where each state is statistically mapped to its corresponding lips images. In test, for given speech input, DNN predicts likely states in terms of their posterior probabilities. Photorealistic lips animation is then rendered through the DNN predicted state lattice with the HMM lips motion synthesizer. In addition to speaker independence, the DNN can also be trained language independently for corresponding gaming or telepresence applications.
In this work, we turn our high quality, 3D photo-realistic talking head into a talking robot. Instead of displaying the 3D talking head onto a flat-screen display, our new 3D physical robot has its 2D rendered image sequence projected onto a plastic talking robot’s face. The 3D talking robot has photo-realistic facial animation which is lip-synced with corresponding audio speech signals. The system consists of three components: a plastic face mask of the robot, a mini-projector which back projects rendered video images onto the plastic mask, and a laptop computer for rendering high quality audio/video for any given text input. The technology can drive different robots for many natural and user friendly applications.
More to come ...
- Lijuan Wang, Wei Han, Frank Soong, and Qiang Huo, Text-driven 3D Photo-Realistic Talking Head, in INTERSPEECH 2011, International Speech Communication Association, September 2011
- King Keung Wu, Lijuan Wang, Frank Soong, and Yeung Yam, A SPARSE AND LOW-RANK APPROACH TO EFFICIENT FACE ALIGNMENT FOR PHOTO-REAL TALKING HEAD SYNTHESIS, in ICASSP 2011, IEEE, 22 May 2011
- Lijuan Wang, Yi-Jian Wu, Xiaodan Zhuang, and Frank Soong, SYNTHESIZING VISUAL SPEECH TRAJECTORY WITH MINIMUM GENERATION ERROR, in ICASSP 2011, IEEE, 22 May 2011
- Lijuan Wang, Wei Han, Xiaojun Qian, and Frank Soong, Photo-Real Lips Synthesis with Trajectory-Guided Sample Selection, in Speech Synthesis Workshop (SSW7), International Speech Communication Association, 27 September 2010
- Xiandan Zhuang, Lijuan Wang, Frank Soong, and Mark Hasegawa-Johnson, A Minimum Converted Trajectory Error (MCTE) Approach to High Quality Speech-to-Lips Conversion, in INTERSPEECH 2010, International Speech Communication Association, 22 September 2010
- Lijuan Wang, Wei Han, Xiaojun Qian, and Frank Soong, Synthesizing Photo-Real Talking Head via Trajectory-Guided Sample Selection, in INTERSPEECH 2010, International Speech Communication Association, 22 September 2010
- Lijuan Wang, Shenghao Qin, and Frank Soong, Auto-Checking Speech Transcriptions by Multiple Template Constrained, in INTERSPEECH 2009, International Speech Communication Association, September 2009
- Lijuan Wang, Xiaojun Qian, Lei Ma, and Frank Soong, A Real-Time Text to Audio-Visual Speech Synthesis System, in INTERSPEECH2008, International Speech Communication Association, 15 September 2008
- Lijuan Wang, Tao Hu, Peng Liu, and Frank Soong, Efficient Handwriting Correction of Speech Recognition Errors with Template Constrained Posterior (TCP), in INTERSPEECH 2008, International Speech Communication Association, September 2008
- Lijuan Wang, Tao Hu, and Frank Soong, Template Constrained Posterior for Verifying Phone Transcriptions, in ICASSP 2008, IEEE, April 2008