Kinect Audio: Preparedness Pays Off
April 14, 2011 11:00 AM PT

It always helps to be prepared. Just ask Ivan Tashev.

A principal software architect in the Speech group at Microsoft Research Redmond, Tashev played an integral role in developing the audio technology that enabled Kinect for Xbox 360 to become the fastest-selling consumer-electronics device ever, with eight million units sold in its first 60 days on the market.

Kinect contributions

Kinect represents part of Microsoft’s deep investment in natural user interfaces, which make computing intuitive to use and able to do far more for users. On April 13, Scott Guthrie, Microsoft corporate vice president of the .NET Developer Platform, announced features of the impending Kinect for Windows non-commercial software-development kit during MIX11, a three-day, web-focused conference in Las Vegas. Tashev himself will be speaking that day about his work in a talk entitled “Audio for Kinect: From Idea to ‘Xbox, Play!’”

Such prominence isn’t earned easily. In the case of the audio functionality for Kinect, it took a combination of preparation and patience to do the trick.

“I spent pretty much my entire career in Microsoft Research,” Tashev says, “knowing that, sooner or later, people would be talking to their computers. I was absolutely sure they would not want to wear a headset. So, from my first day with Microsoft Research, I’ve been working on the problem of hands-free sound capturing from a certain distance in normal conditions and having enough clean sound in the output, good enough for telecommunications and for speech recognition.

“I didn’t know which product would be interested in this. These technologies were designed in Microsoft Research, and, in our experiments, they worked on a small set of data, well enough that we wrote a scientific publication.”

Enter Alex Kipman, general manager of Xbox Incubation within Microsoft’s Interactive Entertainment Business. He was driving the development of Kinect, the revolutionary product that enables controller-free command of an Xbox. He encountered Tashev in 2008 during Microsoft Research’s annual TechFest showcase, and several months later, Kipman decided to follow up.

“We came to Microsoft Research,” he recalls, “and asked: ‘Can you help us make a system that can do speech recognition without having to push a button to talk? We’re all about no buttons, so you can’t have a push-to-talk system.’

“And we said: ‘The system needs to be listening to us 100 percent of the time. You can leave this on for days, and it still needs to work.

“We said: ‘We want a system that can do speech recognition four meters at a distance. You’re not going to have a captive audience a few feet in front of a microphone. People can be anywhere about four meters’ distance, and they should still be able to talk and be recognized.’

“And then we said: ‘Our environment is all about people having fun. If we do our jobs correctly, every single person is going to be having fun, so there’s a lot of noise from the loudspeakers, and the system still needs to pick out the signal when that person to whom you’ve been listening all day says, “Xbox, play movie.”’

Many people might have been daunted by such a formidable laundry list, but not Tashev.

Ivan Tashev
Ivan Tashev

“The most difficult part to resolve was overcoming the problem of the microphones hearing the sound from loudspeakers,” he explains. “First, gamers tend to listen to very loud sounds. Second, the Kinect device is closer to the loudspeakers than to the humans speaking in the room. The sound from the loudspeakers is way louder than the normal human voice.”

The algorithm for this is called acoustic echo cancellation, and it’s included in virtually all speakerphones. But in normal, speakerphone usage, the loudspeaker sound level is about the same as a human voice. In the Kinect-usage scenario, the loudspeakers are louder, the humans are farther away, and the loudspeaker signal is not a single, mono signal—it’s in stereo or surround sound.

That meant that not only did Tashev need to suppress loudspeaker echoes by an order of magnitude louder, but he also had to create the stereo acoustic-echo-cancellation algorithm—a longstanding research problem. And he had to cope with reverberation, which makes speech recognition even more difficult from four meters’ distance, and to capture an enormous dynamic range, one that could tease out, amid blaring loudspeakers, the soft voice of a young child.

Tashev was inured to such challenges. While a professor at the Technical University of Sofia in his native Bulgaria, he had worked with a student on microphone arrays, which enabled the localization of a human speaker with just a couple of microphones. Upon joining Microsoft Research in 2001, Tashev began work on beam-forming research that helped lead to the Microsoft RoundTable, a videoconferencing device with a 360-degree camera. And his algorithm enabled Windows Vista to offer integrated microphone-array support.

He also had pursued some pure research over the years. He began exploring multichannel acoustic-echo cancellation in 2007, and while it remained uncertain who would want such technology, or why, Tashev remained intrigued.

Ready for Action

Thus, when Kipman contacted Tashev in 2009 to inquire about the demo on surround-sound acoustic-echo cancellation seen during TechFest, the researcher might have been caught by surprise, but he certainly wasn’t unprepared.

By May, Tashev had been embedded in the Xbox team, had been briefed about the new product, and was starting to design the audio pipeline for Kinect.

“When we derived the requirements for the pipeline,” he says, “there was a small meeting, and we found that if even if we could take care of all the problems, we still needed an acoustic-echo canceller 10 times better than normal industrial devices have.

“The first reaction was to cut the feature, but Alex said, very firmly: ‘We’re shipping this. We have to make it work.’”

That took a lot of teamwork—and dedication.

“We had the technologies,” Tashev says, “and Xbox has an exceptional engineering team. But even under those conditions, we needed the determination of every member of that team, from developer, tester, and program manager on up to the general manager’s level. Without their hard work and determination, Kinect wouldn’t have happened.”

The speech research posed a significant hurdle for the audio team working feverishly on Kinect.

“Speech is a serious beast,” Tashev acknowledges. “It is still more science and art than engineering. Speech has its strong points, and it has scenarios where it is weaker. If I have to select one of 30,000 songs in my collection, speech is a perfect modality. I can send a speech query like ‘Play me that song about submarines by the Beatles,’ and our existing technology will find, relatively quickly, all songs with ‘submarine’ in the metadata and filter it with ones with ‘Beatles.’ We might end up with three or four candidates.

Voice modality preferred
A combination of speech and gestures gives Kinect a user-friendly interface.

“But then what? If this is a speech-only interface, we’ll have to listen to the computer read us the title of all four songs. That’s annoying—not a good modality for speech.”

On the other hand, in the scenario of selecting from a short list, gestures work perfectly.

“You can just point and select,” Tashev says. “In this simple exercise, speech is good in one part, and gesture is good for another. Combining those, adding sounds and graphics, and we have something very powerful: a multimodal user interface. If properly designed, it can provide an intuitive and natural way of communication between the computer and the human—and does not require any controllers or buttons.”

Thus, when he began work in earnest with the Xbox team, Tashev came complete with a handful of valuable technologies. At that point, it was time to transfer them into an actual product. But things did not necessarily go smoothly. A series of acoustical consultants didn’t believe the target functionality was achievable. But after a series of refinements, the assembled engineers were able to devise an acoustical-analysis program using basic algorithms. It was slow, but it worked.

“Once we can analyze,” Tashev notes, “we can optimize.”

End-to-end optimization: results
Using a computing cluster, architects Ivan Tashev and Wei-ge Chen carefully tuned the acoustical models for Kinect.

They put the program on a large computing cluster and began to vary the parameters of the microphones’ design and placement. After several days of optimization on the cluster, a measurement using the final plastic mold tested at quite close to the desired product specs. By this point, Tashev was acting as the point of contact for almost all Kinect audio issues. In May 2010, the audio-processing pipeline was ready. The next month, the speech recognizer was trained, but the results needed significant improvement.

Tashev went back to Microsoft Research to summon Wei-ge Chen, software architect. They spent four months reiterating and testing, and by Sept. 26, working closely with the Microsoft Tellme group, test results on the latest acoustical models achieved the shipping criteria.

Thirty-four days later, Kinect was shipped to an eager public.

“Microsoft didn’t get to this position by accident,” Tashev smiles. “It happens when you have technologies designed over the years, when you have your teams built—an excellent engineering team from Xbox and very good research from Microsoft Research—and those teams are willing and trained and encouraged to work together. Nothing serious—no breakthrough—happens by accident.”

And, as Tashev celebrated his 10th year with Microsoft Research, and with his contributions having helped Kinect transform consumer electronics, he took a moment to reflect.

Support System

“The algorithms I was supposed to deliver were mine,” he says, “designed by me and people I worked with. But I have been encouraged over the years by my managers, Anoop Gupta, Rico Malvar, and Alex Acero.

“In Xbox, I’d add Alex Kipman, the visionary behind Kinect, the person who said, ‘We want this.’ And, at the end, Ben Kilgore was the person who said: ‘We have a challenge with our audio code. We should ask Microsoft Research to help us ship it.”

The assistance, Kipman makes it abundantly clear, was most appreciated.

“If you know anything about the space of speech recognition,” he says,” you need a few improbables made possible to make this happen. There was one person who really stood up to the challenge and said: ‘You know what? This stuff is improbable, but we’re going to make it happen.’ It’s Ivan, one of the wicked smart people that made the improbable possible.”

Now, with Kinect having shipped, Tashev and others responsible can rest easy, right?

Hardly.

“This is just the beginning,” Tashev says. “We have a lot of new, amazing technologies in the pipeline that will allow us to improve the user experience and the ability for better communication between the humans and the gaming console. I constantly work with those guys to add more interesting tools for the game developers so they can make more fascinating games.”

Nevertheless, the Kinect experience will remain, for Tashev, unforgettable.

“I’ve had a lot of experience working on real systems,” he says, “but never something on the scale of Kinect. It was a really big, end-to-end system with extremely high requirements. Not many people get an opportunity to put together their work for the last seven years into a product that becomes a huge breakthrough. It’s been a unique opportunity.”