Yan Huang, Malcolm Slaney, Michael L. Seltzer, and Yifan Gong
13 September 2014
Modeling heterogeneous data sources remains a fundamental challenge of acoustic modeling in speech recognition. We call this the multi-condition problem because the speech data come from many different conditions. In this paper, we introduce the fundamental confusability problem in multi-condition learning, then discuss the problem formalization, the taxonomy, and the architectures for multi-condition learning. While the ideas presented are applicable to all classifiers, we focus our attention in this work on acoustic models based on deep neural networks (DNN). We propose four different strategies for multi-condition learning of a DNN that we refer to as a mixed-condition model, a condition-dependent model, a condition-normalizing model, and a condition-aware model. Based on the experimental results on the voice search and short message dictation task and the Aurora 4 task, we show that the confusability introduced when modeling heterogeneous data depends on the source of acoustic distortion itself, the front-end feature extractor, and the classifier. We also demonstrate the best approach for dealing with heterogeneous data may not be to let the model sort it out blindly, even with a classifier as sophisticated as a DNN.
|Publisher||ISCA - International Speech Communication Association|