Modeling Co-articulation in Text-to-Audio Visual
Speech
Abstract:
This paper provides our approach to co-articulation for a
text-to-audiovisual speech synthesizer (TTAVS), a system for converting
the input text to video realistic audio-visual sequence. It is an
image-based system, where the face is modeled using a set of images of a
human subject. A concatination of visemes -the corresponding lip shapes
for phonemes- can be used for modeling visual speech. However, in actual
speech production, there is overlap in the production of syllables and
phonemes that are a sequence of discrete units of speech. Due to this
overlap, boundaries between these discrete speech units are blurred, i.e.,
vocal tract motions associated with producing one phonetic segment overlap
the motions for producing surrounding phonetic segments. This overlap is
called as co-articulation. The lack of parameterization in the image-based
model makes it difficult to use the techniques employed in 3D facial
animation models for co-articulation. We introduce a method using
polymorphing to incorporate co-articulation in our TTAVS. Further, we add
temporal smoothing for viseme transitions to avoid jerky animation.