Zhi-Jie Yan, Yao Qian, and Frank K. Soong
14 March 2010
This paper presents a Rich-context Unit Selection (RUS) approach to high quality speech synthesis. Based upon our previous work on rich context modeling, we use the corresponding parametric HMMs to represent waveform units and form a "sausage-like" lattice. A prune-and-search procedure is proposed, in which Kullback-Leibler divergence is adopted to select potential candidate units, and normalized crosscorrelation is used as the final objective measure to search for the optimal unit path. The maximum cross-correlation criterion provides the optimal concatenation between successive units, in terms of spectral similarity, phase continuity and best connecting timing instants. Subjectively, both preference and MOS tests were conducted to compare RUS with our current Weight-table based Unit Selection (WUS) synthesis. Experimental results show that the voice quality of synthesized speech is significantly improved by RUS over the conventional WUS.
In IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, ICASSP 2010
© 2008 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. http://www.ieee.org/