Share this page
Share this page E-mail this page Print this page RSS feeds
Home > Projects > MAVIS
MAVIS

The Microsoft Research Audio Video Indexing System (MAVIS) is a set of software components that use speech recognition technology to index the spoken content of recorded conversations, whether they are from meetings, conference calls, voice mails, presentations, online lectures, or even Internet video.

As the role of multimedia continues to grow in the enterprise, Government and Internet, the need for technologies that better enable discovery and search of such content becomes all the more important.

Microsoft research has been working in the area of speech recognition for over a decade. Recently, Microsoft Research Asia (MSRA) has developed software tools and APIs that can be used in conjunction with Microsoft Sharepoint or SQL to enable audio and video search with the same user experience and indexing infrastructure used for full text document search. This software package referred to as the Microsoft Audio Video Indexing System (MAVIS), uses speech recognition technology to index the spoken content of recorded conversations, whether they are from meetings, conference calls, voice mails, presentations, online lectures, or even Internet video.

MAVIS provides a robust solution to efficient and simple search and retrieval of spoken content in large multimedia archives. The software provides plug-in components that allow standard Microsoft Sharepoint search or SQL Full Text Search functionality to be extended to audio search. At the core of MAVIS is a Large-Vocabulary Speech Recognition (LVCSR) engine and a unique indexing technique called “probabilistic lattice indexing” to achieve robust search and retrieval performance of Audio and Video content.

Below is a sample UI demonstrating the user experience of doing video search using MAVIS and a Microsoft Sharepoint webpart:

Audio Search challenges

There are three key challenges for audio indexing.  The first key challenge is that while humans use speech on a daily basis with ease, technology for speech recognition by machines is still not perfect. This is specifically true for speech found in audio archives, which is often of unplanned, of conversational spontaneous nature, and recorded under a wide variety of imperfect acoustic conditions. For conversational speech found in audio archives, state-of-the-art word accuracies often do not exceed 60%; therefore, just indexing speech-to-text transcripts yields sub-optimal search accuracies. The second key challenge is how to represent speech in a search index such that it can be efficiently indexed and searched. The MSRA Speech group has developed technologies that integrate with Microsoft Sharepoint or SQL to enable audio and video search. The third key challenge is how to present the results of the search so the user can easily navigate to the one that is more contextually relevant, MAVIS enables this through clickable search-result text snippets as displayed in the above screen shot.

Speech-recognition for audio indexing backgrounder

There are two fundamentally different approaches to speech recognition, one referred to as Phonetic indexing and the other Large-vocabulary Continuous Speech Recognition (LVCSR). 

Phonetic indexing is based on phonetic representations of the pronunciation of the spoken terms and has no notion of words. It performs phonetic based recognition during the indexing process, and at search time, the query is translated into its phonetic spelling which is then matched against the phonetic recognition result.Although this technique has the advantage of not depending on a preconfigured vocabulary, it is not appropriate for searching large audio archives of 10,000s hours  because of the high probability of errors using phonetic recognition. It is however appropriate for relatively small amounts of audio as might be the case for searching personal recordings of meetings or lectures. Microsoft has utilized this technique with success to enable the “Audio Search” feature in Office OneNote 2007.

Large-vocabulary continuous speech recognition or LVCSR, which is the subject of this document, turns the audio signals into text using a preconfigured vocabulary and language grammar. The resulting text is then indexed using a text indexer. The LVCSR technique is appropriate for searching large amounts of audio archives which can be 10,000 of hours in length. The vocabulary can be configured to enable recognition of proper nouns such as names of people, places or thing.

Although LVCSR based audio search systems can provide a more accurate search result than phonetic based systems, State-of-the-art LVCSR based speech-recognition accuracy on conversational speech is still not perfect. Researchers at MSR Asia have developed a more accurate technique called “Probabilistic Word-Lattice Indexing” which takes into account how confident the recognition of a word is, as well as what alternate recognition candidates were considered. It also preserves time stamps to allow direct navigation to keyword matches in the audio or video.

Probabilistic Word-lattice indexing

For conversational speech, typical speech recognizers can only achieve accuracy of about 60%. To improve the accuracy of speech search, Microsoft Research Asia developed a technique called ”Probabilistic Word-Lattice Indexing,” which helps to improve search accuracy in three ways:

} Less false negatives: Lattices allow to find (sub-)phrase and ‘AND’ matches where individual words are of low confidence, but the fact that they are queried together allows us to infer that they still may be correct. Word-lattices represent alternative recognition candidates that were also considered by the recognizer, but did not turn out to be the top-scoring candidate.

} Less false positives: Lattices also provide a confidence score for each word match. This can be used to suppress low-confidence matches.

} Time stamps: Lattices, unlike text, retain the start times of spoken words, which is useful for navigation.

Word lattices accomplish this by representing the words that may have been spoken in a recording as a graph structure. Experiments show that indexing and searching this lattice structure instead of plain speech-to-text transcripts significantly improves document-retrieval accuracy for multi-word queries (30-60% for phrase queries, and over 200% for AND queries). For more information on MSRA’s basic lattice method, see the Towards Spoken-Document Retrieval for the Internet: Lattice Indexing For Large-ScaleWeb-Search Architectures  and Towards Spoken-Document Retrieval for the Enterprise: Approximate Word-Lattice Indexing with Text Indexers.. A challenge in implementing probabilistic word-lattice indexing is the size of the word lattices. Raw lattices as obtained from the recognizer can contain hundreds of alternates for each spoken word. To address this challenge MSRA has devised a technique referred to as Time-based Merging for Indexing which brings down lattice size to about 10 x the size of a corresponding text-transcript index, this is orders of magnitude less than using Raw lattices.

More information

For more information on MAVIS please contact mmms@microsoft.com