An Architecture for Rapid Decoding of Large Vocabulary Conversational Speech

George Saon; Geoffrey Zweig; Brian Kingsbury; Lidia Mangu; Upendra Chaudhari

An Architecture for Rapid Decoding of Large Vocabulary Conversational Speech

George Saon ,
Geoffrey Zweig ,
Brian Kingsbury ,
Lidia Mangu ,
Upendra Chaudhari

Proceedings of Eurospeech | January 2003

Download BibTex

This paper addresses the question of how to design a large vocabulary recognition system so that it can simultaneously handle a sophisticated language model, perform state-ofthe-art speaker adaptation, and run in one times real time (1xRT). The architecture we propose is based on classical HMM Viterbi decoding, but uses an extremely fast initial speaker-independent decoding to estimate VTL warp factors, feature-space and model-space MLLR transformations that are used in a final speaker-adapted decoding. We present results on past Switchboard evaluation data that indicate that this strategy compares favorably to published unlimited-time systems (running in several hundred times real-time). Coincidentally, this is the system that IBM fielded in the 2003 EARS Rich Transcription evaluation.