Dong Yu, Xin Chen, and Li Deng
Recently, we have shown that context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) can achieve very promising recognition results on large vocabulary speech recognition tasks, as evidenced by over one third fewer word errors than the discriminatively trained conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we propose and describe two types of factorized adaptive DNNs, improving the earlier versions of CD-DNN-HMMs. In the first model, the hidden speaker and environment factors and tied triphone states are jointly approximated; while in the second model, the factors are firstly estimated and then fed into the main DNN to predict tied triphone states. We evaluated these models on the small 30hr Switchboard task. The preliminary results indicate that more training data are needed to show the full potential of these models. However, these models provide new ways of modeling speaker and environment factors and offer insight onto how environment invariant DNN models may be constructed and subsequently trained.
In IWSML 2012