Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

We apply the recently proposed Context-Dependent Deep- Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the RT03S Fisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%?aa 33% relative improvement. CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-beliefnetwork pre-training. They had previously been shown to reduce errors by 16% relatively when trained on tens of hours of data using hundreds of tied states. This paper takes CD-DNNHMMs further and applies them to transcription using over 300 hours of training data, over 9000 tied states, and up to 9 hidden layers, and demonstrates how sparseness can be exploited. On four less well-matched transcription tasks, we observe relative error reductions of 22¨C28%.

CD-DNN-HMM-SWB-Interspeech2011-Pub.pdf
PDF file

In  Interspeech 2011

Publisher  International Speech Communication Association

Details

TypeInproceedings

Previous Versions

Li Deng, Mike Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoff Hinton. Binary Coding of Speech Spectrograms Using a Deep Auto-encoder, International Speech Communication Association, September 2010.

George Dahl, Dong Yu, Li Deng, and Alex Acero. Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, IEEE Transactions on Audio, Speech, and Language Processing, Special Issue on Deep Learning for Speech and Langauge Processing, January 2012.

Dong Yu and Li Deng. Deep Learning and Its Applications to Signal and Information Processing , IEEE Signal Processing Magazine, IEEE, January 2011.

Dong Yu and Li Deng. Deep-Structured Hidden Conditional Random Fields for Phonetic Recognition, International Speech Communication Association, September 2010.

Abdel-rahman Mohamed, Dong Yu, and Li Deng. Investigation of Full-Sequence Training of Deep Belief Networks for Speech Recognition, International Speech Communication Association, September 2010.

> Publications > Conversational Speech Transcription Using Context-Dependent Deep Neural Networks