Nicolas Boulanger-Lewandowski, Jasha Droppo, Mike Seltzer, and Dong Yu
In this paper, we investigate phone sequence modeling with recurrent neural networks in the context of speech recognition. We introduce a hybrid architecture that combines a phonetic model with an arbi- trary frame-level acoustic model and we propose ef?cient algorithms for training, decoding and sequence alignment. We evaluate the ad- vantage of our phonetic model on the TIMIT and Switchboard-mini datasets in complementarity to a powerful context-dependent deep neural network (DNN) acoustic classi?er and a higher-level 3-gram language model. Consistent improvements of 2–10% in phone accu- racy and 3% in word error rate suggest that our approach can readily replace HMMs in current state-of-the-art systems.