Towards Robust Conversational Speech Recognition and Understanding

While significant progress has been made in automatic speech recognition (ASR) during the last few decades, recognizing and understanding unconstrained conversational speech remains a challenging problem. Unlike read or highly constrained speech, spontaneous conversational speech is often ungrammatical and ill-structured. As the relevant semantic notions are embedded in the set of keywords, the first goal is to propose a model training methodology for keyword spotting. Non-uniform minimum classification error (MCE) approach is proposed which can achieve consistent and significant performance gains on both English and Mandarin large-scale spontaneous conversational speech (Switchboard, HKUST). Adverse acoustical environments degrade the system performance substantially. Recently, acoustic models based on deep neural networks (DNNs) have shown great success. This opens new possibilities for further improving the noise robustness in recognizing the conversational speech. The second goal is to propose a DNN based acoustic model that is robust to additive noise, channel distortions, interference of competing talkers. Hybrid recurrent DNN-HMM system is proposed for robust acoustic modeling which achieves state-of-the-art performances on two benchmark datasets (Aurora-4, CHiME). To study the specific case of conversational speech recognition in the presence of competing talker, several multi-style training setups of DNNs are investigated and a joint decoder operating on multi-talker speech is introduced. The proposed combined system outperforms the state-of-the-art 2006 IBM superhuman system on the same benchmark dataset. Even with a perfect ASR, extracting semantic notions from conversational speech can be challenging due to the interference of frequently uttered disfluencies, filler and mispronounced words, etc. The third goal is to propose a robust WFST based semantic decoder seamlessly interfacing with ASR. Latent semantic rational kernels (LSRKs) are proposed and substantial topic spotting performance gains are achieved on two conversational speech tasks (Switchboard, HMIHY0300).

Speaker Details

Chao Weng is currently a PhD student in the School of Electrical and Computer Engineering at the Georgia Institute of Technology, advised by Prof. Biing-Hwang (Fred) Juang. His research interests lie generally in the areas of speech recognition and natural language processing. The focus of his PhD dissertation is to design and build an unconstrained conversational speech recognition and understanding system which is robust to various adverse acoustical environments. In the summer of 2012, as a research intern, he worked on recurrent neural network language modeling, rational kernels, Mandarin&Japanese speech recognition at AT&T labs research. And in summer of 2013, he was a research intern at Microsoft Research, working on single-channel mixed speech recognition using deep neural networks. In addition, he contributes to the Kaldi project, an open source toolkit for speech recognition.

Date:
Speakers:
Chao Weng
Affiliation:
Georgia Institute of Technology