Search and Decoding Strategies for Complex Lexical Modeling in LVCSR

  • Anoop Deoras

PhD Thesis: PhD Thesis, Johns Hopkins University |

The language model (LM) in most state-of-the-art large vocabulary continuous speech recognition (LVCSR) systems is still the n-gram. A major reason for using such simple LMs, besides the ease of estimating them from text, is computational complexity.

It is also true, however, that long-span LMs, be they due to a higher n-gram order, or because they take syntactic, semantic, discourse and other long-distance dependencies into account, are much more accurate than low-order n-grams. The standard practice is to carry out a first pass of decoding using, say, a 3-gram LM to generate a lattice, and to rescore only the hypotheses in the lattice with a higher order LM. But even the search space defined by a lattice is intractable for many long-span LMs. In such cases, only the N-best full-utterance hypotheses from the lattice are extracted for evaluation. However, the N-best lists so produced, tend to be “baised” towards the model producing them, making the re-scoring sub-optimal, especially if the re-scoring model is complementary to the initial n-gram model. For this reason, we seek ways to incorporate information from long-span LMs by searching in a more unbiased search space.

In this thesis, we first present strategies to combine many complex long and short span language models to form a much superior unified model of language. We then show how this unified model of language can be incorporated for re-scoring dense word graphs, using a novel search technique, thus alleviating the necessity of sub-optimal N-best list rescoring. We also present an approach based on the idea of variational inference, virtue of which, long-span models are efficiently approximated by some tractable but faithful models, allowing for the incorporation of long distance information directly into the first-pass decoding.

We have validated the methods proposed in this thesis on many standard and competitive speech recognition tasks, sometimes outperforming state-of-the-art results. We hope that these methods will be useful for research with long span language models not only in speech recognition but also in other areas of natural language processing such as machine translation, where even there the decoding is limited to n-gram language models.