Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
MAVIS technical background


Speech-Recognition for Audio Indexing Backgrounder

There are two fundamentally different approaches to speech recognition, one referred to as Phonetic indexing and the other Large-vocabulary Continuous Speech Recognition (LVCSR). Phonetic indexing is based on phonetic representations of the pronunciation of the spoken terms and has no notion of words. It performs phonetic based recognition during the indexing process, and at search time, the query is translated into its phonetic spelling which is then matched against the phonetic recognition result. Although this technique has the advantage of not depending on a preconfigured vocabulary, it is not appropriate for searching large audio archives of 10,000s hours because of the high probability of errors using phonetic recognition. It is however appropriate for relatively small amounts of audio as might be the case for searching personal recordings of meetings or lectures. Microsoft has utilized this technique with success to enable the “Audio Search” feature in Office OneNote 2007.

Large-vocabulary continuous speech recognition or LVCSR, which is used in MAVIS, turns the audio signals into text using a preconfigured vocabulary and language grammar. The resulting text is then indexed using a text indexer. The LVCSR technique is appropriate for searching large amounts of audio archives which can be 10,000s of hours in length. The vocabulary can be configured to enable recognition of proper nouns such as names of people, places or thing.

Although LVCSR based audio search systems can provide a more accurate search result than phonetic based systems, State-of-the-art LVCSR based speech-recognition accuracy on conversational speech is still not perfect. Researchers at MSR Asia have developed a more accurate technique called “Probabilistic Word-Lattice Indexing” which takes into account how confident the recognition of a word is, as well as what alternate recognition candidates were considered. It also preserves time stamps to allow direct navigation to keyword matches in the audio or video.


Probabilistic Word-Lattice Indexing

For conversational speech, typical speech recognizers can only achieve accuracy of about 60%. To improve the accuracy of speech search, Microsoft Research Asia developed a technique called ”Probabilistic Word-Lattice Indexing,” which helps to improve search accuracy in three ways:

  • Less false negatives Lattices allow to find (sub-)phrase and ‘AND’ matches where individual words are of low confidence, but the fact that they are queried together allows us to infer that they still may be correct. Word-lattices represent alternative recognition candidates that were also considered by the recognizer, but did not turn out to be the top-scoring candidate.
  • Less false positives Lattices also provide a confidence score for each word match. This can be used to suppress low-confidence matches.
  • Time stamps Lattices, unlike text, retain the start times of spoken words, which is useful for navigation.

Word lattices accomplish this by representing the words that may have been spoken in a recording as a graph structure. Experiments show that indexing and searching this lattice structure instead of plain speech-to-text transcripts significantly improves document-retrieval accuracy for multi-word queries (30-60% for phrase queries, and over 200% for AND queries).

For more information about Microsoft Research Asia’s basic lattice method, see:

A challenge in implementing probabilistic word-lattice indexing is the size of the word lattices. Raw lattices as obtained from the recognizer can contain hundreds of alternates for each spoken word. To address this challenge MSRA has devised a technique referred to as Time-based Merging for Indexing which brings down lattice size to about 10 x the size of a corresponding text-transcript index. This is orders of magnitude less than using Raw lattices.