Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition

  • Chao Weng ,
  • Dong Yu ,
  • Michael L. Seltzer ,
  • Jasha Droppo ,
  • Mike Seltzer

IEEE/ACM Transactions on Audio, Speech, and Language Processing |

We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a WFST-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art  IBM superhuman system by 2.8% absolute with fewer assumptions.