Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Frank Seide, Gang Li, and Dong Yu

Abstract

We apply the recently proposed Context-Dependent Deep- Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the RT03S Fisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%?aa 33% relative improvement. CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-beliefnetwork pre-training. They had previously been shown to reduce errors by 16% relatively when trained on tens of hours of data using hundreds of tied states. This paper takes CD-DNNHMMs further and applies them to transcription using over 300 hours of training data, over 9000 tied states, and up to 9 hidden layers, and demonstrates how sparseness can be exploited. On four less well-matched transcription tasks, we observe relative error reductions of 22┬ĘC28%.

Details

Publication typeInproceedings
Published inInterspeech 2011
PublisherInternational Speech Communication Association

Previous versions

Li Deng, Mike Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoff Hinton. Binary Coding of Speech Spectrograms Using a Deep Auto-encoder, International Speech Communication Association, September 2010.

Dong Yu and Li Deng. Deep Learning and Its Applications to Signal and Information Processing , IEEE Signal Processing Magazine, IEEE, January 2011.

Dong Yu and Li Deng. Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition, International Speech Communication Association, September 2010.

George Dahl, Dong Yu, Li Deng, and Alex Acero. Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing (receiving 2013 IEEE SPS Best Paper Award) , January 2012.

Abdel-rahman Mohamed, Dong Yu, and Li Deng. Investigation of Full-Sequence Training of Deep Belief Networks for Speech Recognition, International Speech Communication Association, September 2010.

> Publications > Conversational Speech Transcription Using Context-Dependent Deep Neural Networks