Mihai Budiu distinctly remembers the buzz surrounding Kinect for Xbox 360 when it was unveiled in 2009 as Project Natal. The public announcements and demos about the technology immediately captured his imagination.
“I got very excited about those demos,” Budiu says, “and I kept telling everyone how cool they were. In retrospect, I’ve come to realize that the internal publicity was very beneficial for the project, because it galvanized so much interest. Everyone who saw the demos wanted to be involved.”
Budiu, a researcher at Microsoft Research Silicon Valley, soon got his chance. His colleague Oliver Williams had been involved with Project Natal for some time. When Williams approached Budiu in July 2009 about solving some of the challenges with the project, Budiu jumped at the opportunity.
“I said, ‘Yes, absolutely,’” Budiu says, “and I put everything else I was doing on hold.”
Budiu’s research focus has been large-scale computation platforms. The first such project on which he worked at Microsoft Research was Dryad, the execution engine under the tool used by Bing to analyze search-engine logs. Dryad has been extended to DryadLINQ, now renamed LINQ to HPC, which is a programming layer atop Dryad that enables programmers to write code to run on large computing clusters using commonly available tools and languages such as Microsoft Visual Studio and Microsoft .NET. At first glance, large-scale parallel processing would seem to have little relevance to an Xbox.
“The problem was skeletal tracking,” Budiu explains. “Kinect has to recognize body parts. Humans do this recognition easily, but our brains have been trained since birth to recognize humans of various shapes in different positions. For Kinect, rather than describe to the computer how to recognize a human, we had the system ‘learn’ what human movements look like by feeding it a large set of examples.”
At the point when Budiu joined the project, Kinect’s machine learning needed to scale up to learn a million sample images within a few hours. Each successive training session enabled the system to recognize variations in body size, clothing, hair, and body positions with greater accuracy. The Kinect team wanted to know whether DryadLINQ would give them faster, more robust, and scalable parallel processing.
“I spoke with Jamie Shotton, the researcher at Microsoft Research Cambridge working on body-part-recognition technology for the image-processing software,” Budiu recalls. “I looked at the code he sent me and knew it would be feasible to implement on DryadLINQ. I managed to sketch out a solution within two hours, and then we spent two weeks adapting the solution to work with all the existing data structures and libraries that the computer-vision and machine-learning experts had developed. Then, in order to meet their goal of processing millions of images within hours, we had to scale up the implementation to run the full data set.”
After seeing this initial proof of concept, the Kinect team decided to deploy machine learning on Dryad and DryadLINQ. They also increased the requirements for the solution. Machine learning improves as developers are able to work with more data, so large-scale computation platforms are essential. The quality of recognition advances with each additional layer of learning, but each new layer the team added essentially doubled the computational requirements. Not only that, but the build team needed to run the machine-learning algorithm multiple times to explore combinations of tuning parameters.
“We ended up solving a problem 100 times larger than initially envisioned,” Budiu says. “This stretched DryadLINQ’s capabilities to the limit, many times to the point where it would break down. This allowed us to uncover some performance bugs in the underlying platform, which we were able to fix fairly quickly. We felt proud that DryadLINQ turned out to be as versatile as we had planned. The Kinect project was proof that, with the right tools, we could solve a problem that initially did not appear to be an easy fit for parallel computation.”
During the proof-of-concept phase, the training algorithms ran on a cluster of research machines. Now, the Kinect team has its own cluster of servers dedicated to running the training algorithms.
Interestingly, even though the training algorithm receives huge amounts of input data, the actual output that drives the skeletal-recognition process is relatively small and easily loaded into an Xbox. Williams, who had worked at Microsoft Research Cambridge before moving to Microsoft Research Silicon Valley, is intimately familiar with this work. Since 2002, he has worked on computer vision, focusing on the use of machine-learning techniques for efficient visual tracking.
It was when he returned for a visit to Microsoft Research Cambridge in early 2009 that Williams became interested in Shotton’s skeletal-tracking work. Then, in May 2009, Williams and Shotton were asked to work with the Kinect team on a skeletal-tracking system. To meet tight deadlines, the researchers worked full-time in close collaboration with the Kinect incubation group.
There were key challenges to solve. While advances such as Kinect’s 3-D camera, developed by the Xbox team, made the computer-vision problem somewhat easier, they needed to develop a system that would work reliably for everyone.
“The road to skeletal tracking is paved with solutions that track one specific human—usually the developer,” Williams says. “We needed to track all body types and sizes, plus deal with more than one body at a time, because Kinect has to distinguish between players and backgrounds, identify which pixels belong to which players, recognize body parts—such as arms, legs, head, and hands—and then output simplified ‘skeletons’ that define the players’ body configurations.”
Getting to a workable solution meant iterating through numerous approaches. Williams gives full credit to the Xbox incubation team for the unflagging enthusiasm and support that enabled them to arrive at a feasible solution on time.
“I got to work very closely with them,” Williams recalls. “The collaboration really brought home the fact that we were not dealing with a single issue. Kinect contains disparate methods and ideas that had to be integrated and made to work as part of a holistic system.”
The team had to implement skeletal tracking in an extremely efficient way so that it would perform in real time on an Xbox. It wasn’t the overall performance envelope of the Xbox that was the problem—it is an extremely well-architected box, Williams says—but the fact that only a small fraction of the machine was available for tracking tasks.
“The real challenge in terms of computational resources,” Williams says, “was that the console not only had to deliver real-time skeletal tracking, but also had to support a modern, fully-fledged game title at the same time. Thankfully, the Xbox team is home to some amazingly talented optimization ninjas, who deserve all the glory for getting everything to run so well.”
Williams found the experience extraordinarily rewarding. There were algorithmic and performance challenges to solve, many of which required novel approaches. It was a matter of testing and validation.
“I came to realize that the value research brings to such scenarios is the ability to triage ideas and combine them well,” Williams says. “That doesn’t mean there wasn’t novelty there, just that to take components in isolation is to miss the more interesting story, which for me is at the system and architecture level.”
Also impressive to Williams was the immense engineering effort required to deliver Kinect as a consumer-ready product with repeatable, reliable performance across diverse operating conditions.
“Through my interactions with both the Xbox incubation group and the platform team,” Williams says, “I have learned a great deal about the commitment, professionalism, and quality standards it takes to make something ready to ship.”
“Kinect was a truly amazing project in that it dramatically pushed the technology envelope of what is known to be achievable,” he says. “Most product teams follow a rigorous schedule, but this project had to craft efficient solutions to a series of difficult problems, almost on the fly. The project was managed to be flexible enough to accommodate some solutions that were unknowns at the beginning. The way all the pieces came together was absolutely amazing. Kinect is by far the most exhilarating project I have ever been involved with. I feel very fortunate to be a part of this effort.”
How has the experience affected the researchers?
“First of all, the research projects I’m engaged in now are much more architectural in scope,” Williams says, “and less about specific technical details—although these never cease to be important. Next, after seeing firsthand the engineering process required to ship a product like Kinect, my approach to building and testing systems, even if it is only for a proof of concept, is now much more grounded, more mindful of what it takes to scale.”
Budiu offers a different perspective.
“For me,” he grins, “one of the best outcomes is that now I can explain to my son what I do for a living!”