Li Deng, Dong Yu, and Alex Acero
A quantitative model of coarticulation is presented that accurately predicts formant dynamics in fluent speech using the prior information of resonance targets in the phone sequence, in absence of actual acoustic data. Realistic formant undershoot (reduction) and “static” sound confusion is produced naturally from the model for fast-rate speech in a contextually assimilated manner. The model developed is capable of resolving the confusion with dynamic speech specification. As a source of a-priori knowledge about the speech structure, the model is a central component of our Bayesian generative modeling approach to automatic recognition of conversational speech, where varying degrees of sound reduction abound due to the free-varying speaking style and rate. We present details of the model simulation that demonstrates quantitative effects of speaking rate and segment duration on the magnitude of reduction, agreeing closely with experimental measurement results in the acoustic-phonetic literature. The model simulation also gives quantitative effects of varying the “stiffness’ parameter in the model.
|Published in||Proc. Int. Conf. on Spoken Language Processing|
|Publisher||International Speech Communication Association|
© 2007 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.