This paper describes a unifying framework for both formant tracking and speech synthesis using Hidden Markov Models (HMM). The feature vector in the HMM is composed by the first three formant frequencies, their bandwidths and their delta with time. Speech is synthesized by generating the most likely sequence of feature vectors from a HMM, trained with a set of sentences from a given speaker. Higher formant tracking accuracy can be achieved by finding the most likely formant track given a distribution of the formants of every sound. This data-driven formant synthesizer bridges the gaps between rulebased formant synthesizers and concatenative synthesizers by synthesizing speech that is both smooth and resembles the speaker in the training data.
|Published in||Proc. of the Eurospeech Conference|