Zhi-Jie Yan, Yao Qian, and Frank K. Soong
6 September 2009
This paper presents a rich context modeling approach to high quality HMM-based speech synthesis. We first analyze the over-smoothing problem in conventional decision tree tying-based HMM, and then propose to model the training speech tokens with rich context models. Special training procedure is adopted for reliable estimation of the rich context model parameters. In synthesis, a search algorithm following a context-based pre-selection is performed to determine the optimal rich context model sequence which generates natural and crisp output speech. Experimental results show that spectral envelopes synthesized by the rich context models are with crisper formant structures and evolve with richer details than those obtained by the conventional models. The speech quality improvement is also perceived by listeners in a subjective preference test, in which 76% of the sentences synthesized using rich context modeling are preferred.
|Published in||10th Annual Conference of the International Speech Communication Association, InterSpeech 2009|
|Publisher||International Speech Communication Association|
© 2007 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.