Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

Chao Weng; Dong Yu; Michael L. Seltzer; Jasha Droppo; Mike Seltzer

Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

Chao Weng ,
Dong Yu ,
Michael L. Seltzer ,
Jasha Droppo ,
Mike Seltzer

IEEE/ACM Transactions on Audio, Speech, and Language Processing | October 2015

Download BibTex

We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains ﬁve key ingredients: a multi-style training strategy on artiﬁcially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a WFST-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a conﬁdence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.

© IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.