Modeling Co-articulation in Text-to-Audio Visual Speech

Abstract:
This paper provides our approach to co-articulation for a text-to-audiovisual speech synthesizer (TTAVS), a system for converting the input text to video realistic audio-visual sequence. It is an image-based system, where the face is modeled using a set of images of a human subject. A concatination of visemes -the corresponding lip shapes for phonemes- can be used for modeling visual speech. However, in actual speech production, there is overlap in the production of syllables and phonemes that are a sequence of discrete units of speech. Due to this overlap, boundaries between these discrete speech units are blurred, i.e., vocal tract motions associated with producing one phonetic segment overlap the motions for producing surrounding phonetic segments. This overlap is called as co-articulation. The lack of parameterization in the image-based model makes it difficult to use the techniques employed in 3D facial animation models for co-articulation. We introduce a method using polymorphing to incorporate co-articulation in our TTAVS. Further, we add temporal smoothing for viseme transitions to avoid jerky animation.