Hsiao-Wuen Hon, Alex Acero, Xuedong Huang, J. Liu, and M. Plumpe
Whistler Text-to-Speech engine was designed so that we can
automatically construct the model parameters from training data.
This paper will describe in detail the design issues of constructing
the synthesis unit inventory automatically from speech databases.
The automatic process includes (1) determining the scaleable
synthesis unit which can reflect spectral variations of different
allophones; (2) segmenting the recording sentences into phonetic
segments; (3) select good instances for each synthesis unit to
generate best synthesis sentence during run time. These processes
are all derived through the use of probabilistic learning methods
which are aimed at the same optimization criteria. Through this
automatic unit generation, Whistler can automatically produce
synthetic speech that sounds very natural and resembles the
acoustic characteristics of the original speaker.
|Published in||Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing|