Microsoft Research opened its Silicon Valley lab in August 2001, naming as its founding director Roy Levin, a veteran of Xerox’s legendary Palo Alto Research Center and the Digital Equipment Corp.’s SystemsResearchCenter. In the five years since, Microsoft Research Silicon Valley has become an integral contributor to Microsoft products and has established a reputation as a respected participant in what is, perhaps, the world’s most vibrant research ecosystem.
Consider the words of Christopher Payne, Microsoft corporate vice president of Windows Live Search:
“Microsoft Research’s lab in Silicon Valley has greatly accelerated our ability to ship our home-grown algorithmic search engine, first in February of ’05 and more recently with Live Search. The team is world-class and has been great to work with. We see even more great things coming out of the Silicon Valley lab.”
As Microsoft Research Silicon Valley prepares to celebrate its fifth anniversary, Levin took a few minutes to discuss the achievements of the lab’s past—and achievements yet to come.
Q: What were the goals you had for your lab at the outset?
Levin: The lab was started in 2001, and the explicit focus when we started the lab was distributed systems, which was a bit different from the other Microsoft Research labs. They all had a much broader compass, and we quite deliberately focused in one area, although that is, in itself, an area that allows a fair amount of breadth.
From the beginning, I felt it was important to have a spectrum of talent, representing not just different areas of technical expertise, but spanning theory to practice. Our initial hiring was with that in mind, and the core group of people we brought on in the first six months or so, about a dozen people, spanned a number of technical areas, as well as this theory-to-practice dimension.
I also felt it was important, given that we were a new organization, to be able to have something to show for ourselves relatively soon, so the initial projects we undertook were not large, long-range research activities. They were things that could be seen to have value at least within a reasonably short timeframe, a couple of years, and nevertheless were high-quality research projects. They weren’t just to do something the product groups might ship tomorrow.
The initial collection of projects had that character, and we were fortunate in that one of the first things we started working on, e-mail spam, was something that was of considerable interest to the product groups. Much of the work we ended up doing to recognize and eliminate spam caught the fancy of the product groups.
The interesting thing is that, in some cases, the key foundations of that work were laid in the early ’90s, almost a decade before the lab got started, from a quite theoretical point of view at that time, and yet that work had real application a decade later. That shows the importance of investing in theory as part of what you do.
Q: Were there other goals beyond doing something valuable and attainable in a relatively short term?
Levin: There were, of course, the goals that all the labs in Microsoft Research have: advancing the state of the art, transferring technology, and collaborating with academia.
We were also in a unique situation in that while we weren’t the first Microsoft Research lab outside of Redmond, we were in Silicon Valley, on Microsoft’s only other major development campus at that time. From the beginning, we were focused on how we could work with the local product groups, not just the product groups in Redmond. We wanted to be seen as an innovator for Microsoft in Silicon Valley, one of the leading innovation centers in the world.
The outward focus on the community was important, and to that end, I’ve been a member of the organization on the campus of all the business leaders who try to improve Microsoft’s relationship with the community. That was an explicit goal, too.
Although we’re a quite small organization, I think we have a greater visibility than you would expect. It’s beneficial to the company, because we are viewed as being an innovator, we are viewed as being an open organization that tries to work with colleagues around the Bay Area and participate in the technical life of Silicon Valley.
Q: How far has the lab come in achieving those goals?
Levin: I’m never satisfied, but I’m very pleased with where we’ve gotten in five years. We founded the lab in a time when, frankly, industrial research in computer science was contracting everywhere except Microsoft, and there was, inevitably, some skepticism in the professional community about whether industrial research was a viable notion anymore. I thought we had something to prove there, and the only way you prove it is by doing top-rate, top-flight research and showing people you can sustain that, you can bring in people who are going to make it their career to do that.
If you look at the collection of people we’ve managed to hire over five years, that speaks for itself. It certainly speaks to the professional community; many of these people are quite well-known, and they’re recognized as leaders in their technical specialties.
As the rest of Microsoft Research has done, we have published in the best places. We’ve played a major role in program committees and journals, indicating that we are at the forefront of those fields, and we’re doing it in this setting of industrial research. We’re on the map, and that was where I wanted to get in five years. I think we’ve gotten there.
We’ve managed to transfer more technology than I expected. I was pleasantly surprised at how well technology transfer works at Microsoft Research. I can’t take credit for the hard work needed to set the assumptions about how that worked. That was done during the 10 years before I came. The ways in which the research organization collaborates with the product groups is, from my way of thinking, the best in the industry by far, better than anything that I’d experienced previously.
I have been pleasantly surprised by the number of transfers that we’ve been able to do, by the receptiveness of the product groups, and by the depth of the collaboration that has been possible.
I’ll give you a concrete example. Not long after we came, there was a lot of emphasis on Microsoft getting into the search business. It was recognized that we needed to really make an investment there. MSN® was using an external search engine, not one of its own creation, and decided it needed to build one. MSN partnered with Microsoft Research and with our lab to build that engine in a very collaborative way. That was something I had not experienced previously. Mostly, you think of technology transfer as a pipeline, and that was not the case here. We put the shoulders to the wheel together from the outset.
As a result, the partnership between Microsoft Research and MSN has been very extensive, one that has, particularly in the search area, become very deep and much stronger.
Q: When you reflect back on the last five years, what are the significant successes in furthering the goals of the lab and Microsoft Research as a whole?
Levin: I already mentioned one of them, the spam-fighting work we did early on. I’ve alluded to another one, our work on the search engine. Let me say a little bit more about that, because it’s more than just building a search engine. The engine itself was, of course, necessary, and one of the key people involved in that was Michael Isard, who did quite a tricky component of the back end of the search engine.
But there was also a lot of work that was done, given that you had an engine, to improve the quality of the results that come out of that engine by setting the relevance of the pages appropriately to the query. That’s an area that is, of course, essential. If your search engine is going to be of use, it’s got to produce relevant results.
One of the problems, of course, is that search engines need to be able to discriminate between pages that are useful and pages that are, in effect, spam. Web-page spam is a serious problem. The search engine needs to be able to automatically classify pages as spam or not, or perhaps some degree in between, and adjust its rankings appropriately.
The technology for doing that was done by Marc Najork, Mark Manasse, and Dennis Fetterly, to find a systematic way of identifying Web-page spam. They did that, it’s been incorporated into the MSN search engine, and it is a significant contributor to producing quality results. That team has been exploring a number of ways to do a better job of assessing the relevance of Web pages to a particular query.
Underlying a lot of that is the notion of similarity. One of the concepts that comes up over and over again is “How similar is this to that?” It might be two images, it might be two pieces of text, it might be a collection of links, it might be the layout of material on a Web page; there are lots of different ways in which pages can be similar. If you’re trying to improve the relevance of results, you probably want to collapse pages that are very similar and not show them all.
Mark Manasse worked in this area for a number of years to develop techniques for taking things from which you extract features. You think of the list of features as being a vector, and you compare two vectors and determine how similar they are. That similarity technique has been applied in many places in search work.
Another area where we’ve had a lot of research is storage, which, particularly in these days of Web-based services, is almost part of the air we breathe. You can’t do anything without having a significant storage infrastructure. Yet although the disks have become a commodity item, the way in which you put them together and build a reliable infrastructure that is cheap to operate, fails gracefully, and has all the right properties is not at all a commodity activity. It’s still very much a custom design activity. If you try to do those sorts of things with commodity components, you will fail.
That’s an area in which we have done quite a bit of research in trying to understand how to build systems that are scalable both in terms of capacity and manageability, as well as reliability, and how to make those systems practical enough that they can be deployed.
We’ve done several research projects in this area. The principal people involved in these projects are Chandu Thekkath and Lidong Zhou, though there have been several others. The work has been a collection of experimental systems and fairly practical designs, including the design for the next-generation storage for Hotmail®. That certainly validates the practicality of the research.
Another area where we’ve had a lot of influence, and it touches on a different part of distributed systems, is in effective parallel programming. One of the challenges is that when it comes to building single CPUs, Moore’s Law is running out. It’s not practical to continue to build faster and faster single CPUs. What we can do is to use the chip area to build multiple CPUs that sit next to each other, but that’s not the dominant programming paradigm today.
There’s a real change coming to our industry, the first stages of which we already feel: how to make it relatively easy for programmers to write applications that will run well on this kind of parallel hardware.
We’ve done work in a couple of different places there. One area is in the programming-tools domain, where, using the programming techniques that have been around for a couple of decades, it’s very easy to make errors that are hard to find. Tools that help programmers avoid those errors are of real importance today. They’ve always been of interest, but now, as the number of people who are going to be writing these applications is growing rapidly, you need these kinds of tools.
We’ve worked on a project called RaceTrack. The name is a pun; it’s intended to deal with a particular kind of parallel-programming problem called a race, where programs make undisciplined access to memory in parallel and mess up the memory as a result. Detecting those bugs is extremely hard; this tool helps programmers to find them. The tool went from research idea to research prototype to something of interest to the product groups to something that is, in fact, very likely to ship in the next version of Visual Studio®. The principals in that are Yuan Yu and Tom Rodeheffer, who have pushed that tool through to a practical result, and it’s being used internally quite a bit.
Q: Regarding the anti-spam work, what was it your lab did, and who was involved in the project?
Levin: That work is, as I say, rooted in theoretical work done back in the early ’90s by Cynthia Dwork and Moni Naor [of Israel’s Weizmann Institute of Science]. Cynthia is a member of our lab, and she was a principal in this work, but it was also a collaboration with several of our systems people, including Ted Wobber, Andrew Birrell, and others.
The question we were trying to address was how to identify spam. One line of research that was already a subject within Microsoft Research was learning filters. That work was done at Microsoft Research Redmond and involves looking at message content and similarity and saying, “This looks a lot like that, and that is spam, so this is going to be spam.”
That was not the work Cynthia and her colleagues did. Rather, they were looking at a different thing that was essentially founded in economics, a project called Penny Black. Suppose that, like the U.S. postal system, the sender had to pay to send a piece of e-mail. Then, what makes spam work, which is the ability to send vast amounts of it at essentially no cost, would be undermined.
The basic idea was to figure out ways in which one could charge the sender. The charge might be money, it might be computing time, it might be some resource that is in finite supply at the sender or that the sender would have to spend real money to get more of in order to make the payment stick.
What lay behind this whole thing was the notion of a computational puzzle: The sender has to solve a puzzle and to send the answer to the puzzle, along with the e-mail. The answer is very cheap to check by the recipient, so it’s easy to validate that this puzzle has been solved correctly. Think of that as the postage stamp. Then the question is: How hard is it to create a postage stamp? That’s where the interesting mathematics comes in, because you need to be able to do something that is provably expensive or at least sufficiently expensive.
That was the key idea: The sender would simply reject something that didn’t have a stamp on it. Then you have a system in which the mail goes through only if the sender is willing to pay enough to create the stamp. I think it was a great idea. It probably has application in other domains, too, but spam was obviously a compelling one. The idea of computational puzzles has been added as an option. It’s not built-in everywhere, but it’s added as an option in the about-to-ship Exchange 12.
Q: How has your lab has been able to engage with academic communities in Silicon Valley and the Bay Area?
Levin: We have some first-rate universities in Silicon Valley, and the mission of Microsoft Research, which involves advancing the state of the art, inherently involves coupling with the academic community, because that is where you validate that you are, in fact, at the cutting edge.
We really want to be connected with those universities. We know many colleagues over many years of collaboration, in many cases predating the lab, at Stanford, at UC Berkeley, at UC Santa Cruz. Some of that work involves graduate students; we have interns from those universities practically every summer. Some involves faculty members themselves, where we have collaborations between individual researchers and faculty. And a number of our researchers teach courses at these universities.
In the area of security, there are fairly close connections between several members of our lab, including Cynthia, and faculty at Stanford. We have a close association with the RAMP project at Berkeley, which is trying to model next-generation computing architecture. Chuck Thacker is working with those folks to be able to emulate some of those architectures efficiently so we can understand in detail what it will be like to program those machines when they’re reduced to silicon.
We’ve had a research collaboration between Mark Manasse and Santa Clara University around storage. We worked for several years with Martin Abadi at the University of California at Santa Cruz, who’s interested in security, among other things. Martin recently joined us; he’s on leave from Santa Cruz. He’s been involved with Microsoft’s Trustworthy Computing academic advisory board for several years, so there’s a close tie there.
Q: Having made this much progress, where do you see the lab going over the next five years?
Levin: I don’t see any reason why we should alter the course that we’re on. The distributed-systems agenda has been good for us. We’ve made a lot of significant innovations in that area, and many of those innovations have turned out to be valuable to the company. I fully expect that we will continue along that path.
I think there are ways to broaden the agenda as we grow. We expect to be investing in the area of computer architecture, which is a natural one, given our distributed-systems focus.
I’m always on the lookout for opportunities. So much of what computing is about these days is collaboration with other fields, and opportunities to do that are particularly attractive to me. I think of computing as serving in the 21st century much of the role that mathematics did several centuries before. It was the handmaiden of the sciences. Now, I think computing is the handmaiden of the sciences. We see computational involvement in pretty much every one of the sciences.
Our group in San Francisco, which is led by Jim Gray, is focused on eScience, which is bringing computing technology to the sciences in places where it hasn’t already been embraced. Computing, of course, has been in the sciences for a long time, but there’s a lot more that could be done, and Jim’s work on the TerraServer has been a great example of that. I expect that work to continue.
If you look to the next five years, I expect to see a broadening of our engagement with particular scientific disciplines in ways that make computation—in particular, computation involving large amounts of data—become a real partnership between Microsoft Research and those individual sciences.
Q: When you look back on the last five years, what has been the Silicon Valley lab’s greatest accomplishment?
Levin: It has to be the people. A research lab is nothing if it isn’t its people. I’m very proud of the team that we’ve managed to assemble. We’ve attracted top-flight researchers in a variety of relevant disciplines with the expertise that we need to be able to create the innovations that we have and that we will. I think when people look to the lab, they see the people and the quality of the work that we produce.
|