Abdel-rahman Mohamed, Dong Yu, and Li Deng
Recently, Deep Belief Networks (DBNs) have been proposed for phone recognition and were found to achieve highly competitive performance. In the original DBNs, only frame-level information was used for training DBN weights while it has been known for long that sequential or full-sequence information can be helpful in improving speech recognition accuracy. In this paper we investigate approaches to optimizing the DBN weights, state-to-state transition parameters, and language model scores using the sequential discriminative training criterion. We describe and analyze the proposed training algorithm and strategy, and discuss practical issues and how they affect the final results. We show that the DBNs learned using the sequence-based training criterion outperform those with frame-based criterion using both three-layer and six-layer models, but the optimization procedure for the deeper DBN is more difficult for the former criterion.
|Published in||Interspeech 2010|
|Publisher||International Speech Communication Association|
© 2007 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.
Frank Seide, Gang Li, and Dong Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks, International Speech Communication Association, August 2011.