By Janie Chang
May 26, 2011 9:00 AM PT
Not long ago, Internet content was mostly text-based, with search tools supporting the need to index text efficiently and browsers providing the ability to search within a document for every instance of a keyword or phrase.
Now, multimedia content has exploded onto the scene, thanks to technology that makes it easy to create and share multimedia. High-quality video cameras have become affordable, and every phone contains a camera. Low storage and bandwidth costs make it viable to upload and access large multimedia files, and the growth of social networking provides venues for consumers to share their experiences via audio, video, and photos. Search engines now find images, audio, and video files that have been tagged with text.
For short audio or video clips, a textual description of the content may be sufficient. When faced with a two-hour video, or a collection of videos that could number in the hundreds or even thousands, users lack the equivalent of a document’s Find function that enables them to skip through footage directly to spots where a keyword or phrase is mentioned.
“Imagine having to read through every text document,” says Behrooz Chitsaz, director of IP Strategy for Microsoft Research, “just to find the one paragraph that contains the one topic of relevance. This is basically how we are consuming speech content today, and we want to change that.”
Multimedia search, he says, is much the same as it was 10 years ago: heavily textual, with limited capabilities for searching in audio or video files for specific words. When these capabilities do exist, they usually are applied to content such as popular movies or lyrics. There is technology that automates speech recognition, but it often still requires a person to listen to and transcribe each audio or video file to make it search-ready.
Hence Chitsaz’s enthusiasm for MAVIS, the Microsoft Research Audio Video Indexing System. MAVIS comprises a set of software components that use speech-recognition technology to enable efficient, automated indexing and searching of digitized spoken content. By focusing on speech recognition, MAVIS not only enables search within audio files, but also within video. Footage from meetings, presentations, online lectures, and other, typically non-closed-captioned content all benefit from a speech-based approach.
How significant is the functionality envisioned by MAVIS? Chitsaz and Microsoft Research Asia’s Frank Seide, senior researcher and research manager, and Kit Thambiratnam, lead researcher, are conducting technical previews to determine that. MAVIS has been running on a trial basis on digital archives for the U.S. states of Georgia, Montana, and Washington, as well as the U.S. Department of Energy, the British Library, and, most recently, CERN, the European Organization for Nuclear Research.
“I started to realize this was really important,” Chitsaz recalls, “when the state of Washington contacted us. They had audio files of House of Representatives sessions from the ’70s and ’80s that they were transferring from tapes to digital files. They had digitized the audio but didn’t know what was in them. It was like they had backups but no way to restore them.”
As a matter of policy, governmental organizations have to archive meetings for public access, but the archives are of little use if they can’t be searched. Manual transcriptions are expensive, and it is unreasonable to expect state residents and legislators to listen through hours of recordings to find relevant information.
MAVIS was able to index the thousands of files automatically. Now, governmental users and state residents have the ability to search for topics of interest by keyword or phrase. The search results enable users to retrieve from a list the precise moments in specific sessions when the keyword was mentioned and jump directly to those spots. Because MAVIS is integrated into the archive’s text-search infrastructure, the search mechanism and user experience is the same as searching textual documents.
“MAVIS has made legislative research far easier and faster,” Chitsaz says. “Users can search through tens of thousands of session hours and find discussions on a particular bill or issue. They can discover exactly how debates went or the original, historical reasons behind certain decisions. This has an enormously positive impact on government transparency.”
Governmental archives are an ideal starting point for implementing MAVIS. Magnetic tapes start to degrade after about 30 years, a factor that is driving digital-preservation initiatives. With such measures, there is an increased need for technologies that search and categorize multimedia files. The content of such archives is also ideal, because the recordings are “speech-recognition friendly”—mostly speech, with minimal background noise.
Background noise is only one of the challenges MAVIS researchers are trying to solve in the quest for high-accuracy speech-recognition. Their goal is for MAVIS to handle general conversational speech, and that means coping with variables such as accents, ambient and background noise, reverberation, vocabulary, and language.
“Our brains can filter out noises,” Chitsaz notes, “but it’s hard for a computer. Vocabulary is also difficult. For instance, domains such as health care have specific terminologies. There’s also context, which helps humans understand—but that’s hard to introduce to a computer system. We were confronted with all those variables. Speech recognition isn’t new—it’s all about developing techniques that make it highly accurate.”
An important step forward has been a technique developed by researchers at Microsoft Research Asia called Probabilistic Word-Lattice Indexing, which improves accuracy for conversational speech indexing. Lattice indexing adjusts for the system’s confidence rating for recognition of a word and alternate recognition candidates.
“When we recognize the audio track of a video,” Seide explains, “we keep the alternatives. If I say ‘Crimean War,’ the system may think I’ve said ‘crime in a war,’ because it lacks context. But we retain that as an alternative. By keeping the multiple word alternatives as well as the highest-confidence word, we get much better recall rates during the search phase.
“We represent word alternatives as a graph structure: the lattice. Experiments showed that when it came to multiword queries, indexing and searching this word lattice significantly improved results for document-retrieval accuracy compared with plain speech-to-text transcripts: a 30- to 60-percent improvement for phrase queries and more than 200-percent better for queries consisting of multiple words or phrases.”
Another challenge is handling the broad range of potential topics.
“Unfortunately, speech recognizers are pretty dumb and can only recognize words they’ve seen before,” Thambiratnam explains. “That means many useful terms like names and technology jargon probably aren’t going to be known to our speech recognizer. We leverage Bing search to try to solve that, essentially trying to guess up front what words are most relevant for a video and then finding data on the web that we can use to adapt the vocabulary of our speech recognizer so that it does a better job on a particular file.”
Another piece of information critical to the usability of MAVIS is timing information: The system keeps timestamps so that search results include the times in an audio or video stream where the word occurs.
Accurate speech recognition is compute-intensive and, thereby, an application ideal for a cloud-computing environment. The MAVIS architecture takes advantage of the Windows Azure Platform to handle the speech-recognition process. The actual multimedia content can live behind the content owner’s firewall, and the organization can submit thousands of hours of audio and video for indexing by the speech-recognition application running in an Azure application without having to invest in upgrading in-house computing infrastructure.
While MAVIS provides tools that make it easy to submit audio and video content for indexing, just as critical to usability is the format of the results, which come back in a file that can be imported into Microsoft SQL Server for full text indexing. This enables audio or video content to be searched just as any other textual content can be searched.
“Compatibility with SQL Server is very important,” Chitsaz comments, “because it means that searching for spoken words inside audio or video files becomes just like searching for text in an SQL Server database, a process familiar to IT organizations. We are not introducing a new search mechanism. They can maintain the same search infrastructure and processes.”
A demo of MAVIS at the Microsoft Video Web proves the success of the team’s implementation. The site contains more than 15,000 MSNBC news videos. Searches are fast, and the results enable direct access into a video stream. Users can see textual information, as well as a timeline that shows where a keyword or phrase occurs.
Even for Chitsaz, who is intimately familiar with the technology, the information MAVIS delivers still manages to surprise. During Iceland’s volcanic eruptions in April 2010, he used MAVIS to search Microsoft’s video archives to see what was available on the topic of “volcano.” He found more than he expected: lectures that included imagery, sensors for tracking volcano activity, and interviews with people who had experienced volcano eruptions. When searching through government archives, he has found interesting discussions around topics such as taxes, public safety, and the environment, which still have a bearing on people in a community.
MAVIS, Chitsaz says, is a disruptive technology that will affect the way we consume speech content, much the same way web searches affected the consumption of text content on the Internet.
“Each time I experience the value of MAVIS for myself,” he says, “it occurs to me that the textual information originally associated with the files did not include the term I was searching on. Without MAVIS, I would not have known about the information locked in those files.”