Yao Qian, Zhi-Jie Yan, Yi-Jian Wu, Frank K. Soong, Xin Zhuang, and Shengyi Kong
26 September 2010
The current state-of-art HMM-based speech synthesis can produce highly intelligible speech but still carries the intrinsic vocoding flavor due to its simple excitation model. In this paper, we propose a new HMM Trajectory Tiling (HTT) approach to high quality TTS. HMM is improved first with the minimum generation error (MGE) training. Trajectory generated by the refined HMM is then used to guide the search for the closest waveform segment "tiles" in rendering highly intelligible and natural sounding speech. Normalized distances between the HMM trajectory and those of waveform unit candidates are used for constructing a unit sausage (lattice). Normalized cross-correlation, a good concatenation measure for its high relevance to spectral similarity, phase continuity and concatenation time instants, is used to finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to track closely the HMM trajectory guide. Tested on the two (small and large) British English databases used in Blizzard Challenge 2010, our HTT approach can render natural sounding speech without sacrificing the high intelligibility achieved by HMM-based TTS. They are confirmed subjectively by the corresponding AB preference and intelligibility tests.
|Published in||11th Annual Conference of the International Speech Communication Association, InterSpeech 2010|
|Publisher||International Speech Communication Association|