Geoffrey Zweig and Patrick Nguyen
We propose a theoretical framework for doing speech recognition with segmental conditional random fields, and describe the implemenation of a toolkit for experimenting with these models. This framework allows users to easily incorporate multiple detector streams into a discriminatively trained direct model for large vocabulary continuous speech recognition. The detector streams can operate at multiple scales (frame, phone, multi-phone, syllable or word) and are combined at the word level in the CRF training and decoding processes. A key aspect of our approach is that features are defined at the word level, and can thus identify long span phenomena such as the edit distance between an observed and expected sequence of detection events. Further, a wide variety of features are automatically constructed from atomic detector streams, allowing the user to focus on the creation of informative detectors. Generalization to unseen words is possible through the use of decomposable consistency features [1, 2], and our framework allows for the joint or separate training of the acoustic and language models.