VTalk: A System for Generating Text-to-Audio-Visual Speech
- Prem Kalra ,
- Ashish Kapoor ,
- Udit Kumar Goyal
IETE Technical Review | , Vol 18(4): pp. 307-314
This paper describes VTalk, a system for synthesizing text-to-audiovisual speech (TTAVS), where the input text is converted Into an audiovisual speech stream incorporating the head and eye movements. It is an image-based system, where the face is modeled using a set of images of a human subject. A concatination of visemes—the corresponding lip shapes for phonemes—can be used for modeling visual speech. A smooth transition between visemes is achieved using morphing along the correspondence between the visemes obtained by optical flows. The phonemes and timing parameters given by the text-to-speech synthesizer determines the corresponding visemes to be used for the synthesis of the visual stream. We provide a method using polymorphing to incorporate co-articulation during the speech In our TTAVS. We also include nonverbal mechanisms in visual speech communication such as eye blinks and head nods, which make the talking head model more lifelike. For eye movement, a simple mask based approach is employed and view morphing is used to generate the intermediate images for the movement of head. All these features are integrated into a single system, which takes text, head and eye movement parameters as input and produces the complete audiovisual stream.