DEEP NEURAL NETWORK (DNN) FOR TTS SYNTHESIS

The Samples of Synthesized Speech

Please click HMM, DNN and RNN to play    HMM
(MDL=1)

DNN

(1k*3)   

RNN

(512*2_Sigmoid

+512*2_BLSTM)   

Hyperion, however, tumbles erratically as gravity from nearby moons tugs on its irregular shape. HMM DNN RNN
They've all dried out ; it's all carrot juice. HMM DNN RNN
That's why Kathy could not change Ruby's behavior. HMM DNN RNN
But to hear South African coach Kitch Christie talk , it's Lomu who should be worried. HMM DNN RNN
When coaxing failed , the child's nose was plugged. HMM DNN RNN
My wife has the showplace she always wanted. HMM DNN RNN
France, Japan and Germany all now give more aid to Africa than America does. HMM DNN RNN
Shoe the trainer never matched Shoe the jockey . HMM DNN RNN
The Scottish club beat out bids from English teams Aston Villa , Leeds and Chelsea . HMM DNN RNN
The drugs made her so tired she could barely stay awake during school . HMM DNN RNN

Submitted to Interspeech 2014

TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks

Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong

Abstract
Feed-forward, Deep neural networks (DNN)-based TTS systems have been recently shown to outperform decision-tree based, HMM TTS systems . However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are needed to constrain speech parameter trajectory generation in HMM-based TTS. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurring information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.
---------------------------------------------------------------------------------------------------------------

To be appeared in ICASSP 2014

On the Training Aspects of Deep Neural Network (DNN) for Parametric TTS synthesis

Yao Qian, Yuchen Fan, Wenping Hu, Frank Soong

Abstract
Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, was investigated for parametric TTS synthesis with a huge corpus (33,000 utterances). In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is firstly trained in ML and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian, probability density family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a better DNN; state boundary info is important for training DNN to yield better synthesized speech; and the hyperbolic tangent activation function in DNN hidden layers can help training to converge faster than sigmoid.