Zhi-Jie Yan, Qiang Huo, and Jian Xu
25 August 2013
We present a new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR). The DNN-based feature extractor is trained from a subset of training data to mitigate the scalability issue of DNN training, while GMM-HMMs are trained by using state-of-the-art scalable training methods and tools to leverage the whole training set. In a benchmark evaluation, we used 309-hour Switchboard-I (SWB) training data to train a DNN first, which achieves a word error rate (WER) of 15.4% on NIST-2000 Hub5 evaluation set by a traditional DNN-HMM based approach. When the same DNN is used as a feature extractor and 2,000-hour "SWB+Fisher" training data is used to train the GMM-HMMs, our DNN-GMM-HMM approach achieves a WER of 13.8%. If per-conversation-side based unsupervised adaptation is performed, a WER of 13.1% can be achieved.
|Published in||14th Annual Conference of the International Speech Communication Association, InterSpeech 2013|
|Publisher||International Speech Communication Association|