Yi-Ning Chen, Zhi-Jie Yan, and Frank K. Soong
26 September 2010
In HMM-based TTS, statistical models of static, velocity (delta), and acceleration (delta-delta) parameters are jointly trained in a unified, ML-based framework. Previous study has shown that the acceleration parameters are able to generate smoother trajectory with less distortions, but the effect has never been investigated in formal objective and subjective tests. In this paper, the effect of the acceleration parameters, in addition to their static and velocity counterparts, in trajectory generation is studied in depth. We show that discarding acceleration parameters only introduces small additional distortion compared to the reference generated with full model parameters. But human subjects can easily perceive the voice quality degradation, because saw-tooth-like trajectories are commonly generated. Several methods to alleviate the discontinuity are discussed, and we choose the upper- and lower-bounded envelopes of the saw-tooth trajectories for further analysis. Experimental results show that both envelope trajectories have larger objective distortions than the saw-tooth ones. However, the speech synthesized using the envelope trajectory becomes perceptually transparent to the reference. This study, in addition to its subjective and objective significance in measuring the distortion of the synthesized speech, facilitates efficient implementation of low-cost TTS systems, as well as low bit rate speech coding and reconstruction.
In 11th Annual Conference of the International Speech Communication Association, InterSpeech 2010
Publisher International Speech Communication Association