DEEP NEURAL NETWORK (DNN) FOR TTS SYNTHESIS

The Samples of Synthesized Speech

Please click HMM, DNN and RNN to play    HMM
(MDL=1)

DNN

(1k*3

MSE SGE)   

RNN

(512*2_Sigmoid

+512*2_BLSTM)   

Hyperion, however, tumbles erratically as gravity from nearby moons tugs on its irregular shape. HMM DNN DNN_SGE     RNN
They've all dried out ; it's all carrot juice. HMM DNN DNN_SGE     RNN
That's why Kathy could not change Ruby's behavior. HMM DNN DNN_SGE     RNN
But to hear South African coach Kitch Christie talk , it's Lomu who should be worried. HMM DNN DNN_SGE     RNN
When coaxing failed , the child's nose was plugged. HMM DNN DNN_SGE     RNN
My wife has the showplace she always wanted. HMM DNN DNN_SGE     RNN
France, Japan and Germany all now give more aid to Africa than America does. HMM DNN DNN_SGE     RNN
Shoe the trainer never matched Shoe the jockey . HMM DNN DNN_SGE     RNN
The Scottish club beat out bids from English teams Aston Villa , Leeds and Chelsea . HMM DNN DNN_SGE     RNN
The drugs made her so tired she could barely stay awake during school . HMM DNN DNN_SGE     RNN

 

Submitted to Singal Processing Letters

Sequence Generation Error (SGE) Minimization Based DNN Training for Text-to-Speech Synthesis

Yuchen Fan, Yao Qianand Frank K. Soong

Abstract Feed-forward deep neural network (DNN) based TTS, which employs a multi-layered structure to exploit the statistical correlations between rich contextual information and the corresponding acoustic features, has been shown to outperform a decision tree-based, GMM-HMM counterpart. However, the DNN TTS training has not taken the whole synthesized sequence, i.e., sentence into account in the optimization procedure, hence results in some intrinsic inconsistency between training and testing. In this paper we propose a “sequence generation error” (SGE) minimization for DNN-based TTS training. By incorporating the whole sequence parameter generation into the training process, the mismatch between training and testing is eliminated and the original constraints between the static and dynamic features are naturally embedded in the optimization process. Experimental results performed on a speech database of 5 hours show that DNN-based TTS trained with this new SGE minimization criterion can further improve the DNN baseline performance, particularly, in subjective listening tests.

---------------------------------------------------------------------------------------------------------------

To be appeared in Interspeech 2014

TTS Synthesis with Bidirectional LSTM based Recurrent Neural Networks

Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong

Abstract
Feed-forward, Deep neural networks (DNN)-based TTS systems have been recently shown to outperform decision-tree based, HMM TTS systems . However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are needed to constrain speech parameter trajectory generation in HMM-based TTS. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurring information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.
---------------------------------------------------------------------------------------------------------------

ICASSP 2014

On the Training Aspects of Deep Neural Network (DNN) for Parametric TTS synthesis

Yao Qian, Yuchen Fan, Wenping Hu, Frank Soong

Abstract
Deep Neural Network (DNN), which can model a long-span, intricate transform compactly with a deep-layered structure, was investigated for parametric TTS synthesis with a huge corpus (33,000 utterances). In this paper, we examine DNN TTS synthesis with a moderate size corpus of 5 hours, which is more commonly used for parametric TTS training. DNN is used to map input text features into output acoustic features (LSP, F0 and V/U). Experimental results show that DNN can outperform the conventional HMM, which is firstly trained in ML and then refined by MGE. Both objective and subjective measures indicate that DNN can synthesize speech better than HMM-based baseline. The improvement is mainly on the prosody, i.e., the RMSE of natural and generated F0 trajectories by DNN is improved by 2 Hz. This benefit is likely from the key characteristics of DNN, which can exploit feature correlations, e.g., between F0 and spectrum, without using a more restricted, e.g. diagonal Gaussian, probability density family. Our experimental results also show: the layer-wise BP pre-training can drive weights to a better starting point than random initialization and result in a better DNN; state boundary info is important for training DNN to yield better synthesized speech; and the hyperbolic tangent activation function in DNN hidden layers can help training to converge faster than sigmoid.