An HMM Trajectory Tiling (HTT) Approach to High Quality TTS

We propose a new HMM Trajectory Tiling (HTT) approach to high quality TTS. HMM is improved first with the minimum generation error (MGE) training. Trajectory generated by the refined HMM is then used to guide the search for the closest waveform segment “tiles” in rendering highly intelligible and natural sounding speech. Normalized distances between the HMM trajectory and those of waveform unit candidates are used for constructing a unit sausage (lattice). Normalized cross-correlation, a good concatenation measure for its high relevance to spectral similarity, phase continuity and concatenation time instants, is used to finding the best unit sequence in the sausage. The sequence serves as the best segment tiles to track closely the HMM trajectory guide.

Demo 1

Synthesized sentence "Philip walked to the door."  (Blizzard Challenge 2011, 10 hours U.S. English corpus)
by HMM-based TTS  Click to Play

by HTT Click to Play

more synthesized sentences by                                      HMM   HTT

One day I'm going to work here, he said.                           Play   Play          
We appreciate this is a difficult issue of international law.     Play  Play
We have had road rage, supermarket rage, and office rage.  Play   Play
To an extent, the council too is worn out.                         Play   Play
He accused the pair of frequently aiming at the groin.          Play   Play

Demo 2

Synthesized sentences by HTT (Blizzard Challenge 2010, 5 hours British English corpus)

   Click to Play             Click to Play

Demo 3

Synthesized sentences by HTT (9 hours Chinese Mandarin Corpus)

   Click to Play              Click to Play

 

People

Yao Qian (yaoqian@microsoft.com); Frank Soong (frankkps@microsoft.com);

Zhijie Yan (zhijiey@microsoft.com)