Patience is oftentimes a virtue for computer-science researchers. Sometimes, their explorations don’t work out as expected. Sometimes, promising projects provide solutions for needs not yet apparent. Sometimes, software or hardware must evolve to the point where implementation becomes practical.
On other occasions, though, researchers’ projects dovetail nicely with a problem that clearly requires a quick improvement.
The three are co-authors of a paper called VL2: A Scalable and Flexible Data Center Network, which will be delivered during SIGCOMM 2009, the annual flagship conference of the Association for Computing Machinery’s Special Interest Group on Data Communications.
The conference, to be held in Barcelona, Spain, from Aug. 17-21, features six papers from Microsoft Research, representing its Redmond, Asia, and Silicon Valley labs and written with assorted academic colleagues. One of them—White Space Networking with Wi-Fi Like Connectivity, written by Victor Bahl, Ranveer Chandra, and Thomas Moscibroda of Microsoft Research Redmond, along with Rohan Murty and Matt Welsh of Harvard University—was named the best paper of SIGCOMM 2009.
The VL2 paper—co-written by Navendu Jain, Srikanth Kandula, and Sudipta Sengupta, all of Microsoft Research; Parantap Lahiri of Microsoft’ s Global Network Services group; and Changhoon Kim, formerly a Microsoft Research intern from Princeton University and now a member of the Windows Azure team— promises to garner special attention for its potential to improve data-center performance.
“We’ve talked a lot to the people who are the users and the creators of Microsoft’s existing data-center network over the last couple of years,” says Greenberg, a principal researcher with the Networking Research Group, “and learned about what they now have and the problems with what they now have. Then we teamed with them to create a much better solution.”
As for when this solution, which exploits the capabilities of new, cost-efficient Ethernet switches, can be put to good use, Greenberg and colleagues are thinking in the short term.
“In networking, there are times when things change,” Greenberg says. “Then there are long periods of no change.”
This, he argues, is one of the windows of opportunity to effect a significant upgrade of existing data-center architecture and, thereby, make a big difference.
Most parts of a data centers are designed to scale out by the addition of more cheap components, such as adding servers to solve hard problems like Web search, rather than to scale up by adding power and complexity to a few expensive components. But that’s not the case with the network in a conventional data center, which concentrates traffic in a few hardware components that require frequent upgrades and replacements to keep up with demand.
“Data-center networks use the same structure we use in enterprise networks,” Greenberg says, “and the structure has really big weaknesses. It’s recognized pretty widely around the industry that this has to change.
“It’s a rare opportunity we have to seize, because it will otherwise pass, and we’ll be in one of these periods of stasis for a while. We think what distinguishes our design is that we can get the change done now. We can use existing capabilities in the switches, make some changes on the end systems, create a few more components under our control, and make the change now. We think we can get the improvements that we need immediately.”
In the data-center business, in which multiple services are hosted simultaneously, the key is agility, being able to assign any server to any service. The more agile a data-center network is, the more efficient the utilization of money and resources.
“Data-center servers have low utilization, a well-known fact,” explains Patel, a research software-development engineer, “and part of the reason is the network gets in the way. Once we achieve agility, any server can be assigned to any service, and we should be able to lower the cost of the data center and improve the utilization of resources we have.”
Adds Greenberg: “If we can get that agility, then we lower the cost of the data center tremendously. It gives the customer the illusion of having infinite resources that can grow and shrink whenever they want, and it gives us the ability to deliver that at a low cost. The importance of agility was made clear through interactions with Yousef Khalidi on the Azure team.”
Today’s data-center architectures make that a challenge. Existing architectures don’t provide enough capacity between the servers they interconnect. Data-center networks do little to keep a flood of traffic to one service from affecting others. And the routing design assigns servers topologically significant IP addresses and divides servers among virtual local-area networks, imposing a huge configuration burden when traffic must be reassigned among services. The human involvement necessary to manage the reconfiguration limits the speed at which this can be accomplished.
Developers working on data-center service applications don’t want to have to work around the limitations of networks. They’d much prefer to assume that all the servers assigned to their service, and only those servers, are connected by a single Ethernet switch.
The Microsoft researchers propose the use of a Virtual Layer 2 (VL2). The goals are to enable data-center networks to provide:
“There are certain principles for designing a network,” says Maltz, a researcher also with the Networking Research Group, “and our research has been formulating those principles, deciding which are most important, and evaluating whether they can be feasibly achieved. That’s what I think we’ve done.
“One of the principles we found was important was to structure the network to free the application and management inside the data center from having to worry about the details of the network. That gave rise to the idea of wanting to offer this Virtual Layer 2 notion to applications that are going to run inside your data center. .”
Maltz cites a couple of other principles.
“Another is the notion of separating names from locations,” he says. “That is part of implementing this Virtual Layer 2 concept. We wanted applications to use any IP addresses or names they want to refer to each other. That should not be tied to the underlying physical layout of the servers. We’re trying to build a cloud infrastructure, and we want to be abstracting away as much of the physical details of the racks and the switches as possible from the application—sort of a divide-and-conquer approach.
“A third principle is how to build a high-speed interconnection that gives uniform high capacity and low latency between all the services. You can use techniques like creating a very dense mesh with multiple paths and using a mechanism called Valiant Load Balancing to direct traffic across the network.”
The latter technique serves to spread traffic over the fabric of a network in random fashion.
“I pick a random intermediate router somewhere in my network,” Maltz says, “and I force my packets to bounce through that. This might seem counterintuitive, but it means you can route any traffic matrix equally well. If you optimize your network for handling any particular traffic pattern, often that means there will be some traffic patterns that the network will do really badly on. Our approach is to deliberately, randomly spread traffic and arrange for this average-case traffic matrix to get excellent handling by optimizing for the general case and using randomization to force traffic patterns to be more general.”
There are other efforts under way to accomplish similar goals. One is from Microsoft Research Asia, to use end servers instead of switches to do the high-speed forwarding. This idea has potential, particularly as multicore technology extends server capabilities.
Another proposal, from the University of California, San Diego, also uses commodity switches. But that project modifies the switches instead of the end systems, so it can’t be built using switches currently available.
“You have to standardize those changes,” Patel explains, “convince switch vendors, as well as standards bodies, to have them standardized. The novelty of our approach is we can do it right now just by making small changes to the server operating system. In the data center, customizing the server operating system is natural and common. Azure’s RedDog OS is a perfect example. Applications are unaware that any of this is happening underneath.”
Highly available distributed systems for data-center server management, such as Bing’s Autopilot, have already demonstrated that a huge, resilient computational fabric can be created from economical components. The Microsoft Research team is taking this approach to data-center networking.
A couple of years ago, the researchers met with people working on Windows networking and Microsoft Global Foundation Services (GFS), the group that runs Microsoft’s largest data centers, to discuss issues involving data-center networking.
“We learned,” Greenberg recalls, “about the need for the high capacity, the separation of names and locations to give us agility … all these things that are not available today. We were thinking, ‘Never again will we build something that doesn’t have those capabilities.’ ”
The researchers wrote a workshop position paper, then went to work.
“I think we innovate by doing,” Greenberg says. “We just started building it. We kept coming toward the same goal, but using our better understanding of what our building blocks are. We didn’t have to reinvent things we could get from the industry.”
Of course, working at Microsoft, with access to experts steeped in the vagaries and intricacies of such networks, provided a rich source of feedback.
“It came from talking to people that appreciate the current design,” Greenberg adds. “They know what’s good about it, and they know what they don’t like. We built and demoed this, and we could tell there was excitement around the company. We didn’t just read a lot of papers and decide this was the way to go. We also looked at what we’ve got. Microsoft has both research and operations, so we have incredible advantages toward understanding the real problem and to go after it.”
That’s not to say that there weren’t certain challenges along the way.
“While rethinking the data-center network,” Patel says, “one of the challenges was determining what network technologies in existing switches to rely on and what to rebuild. We built three different prototypes using different technologies in switches to figure out which ones would give us the optimal balance of price, reliability, and performance.”
Such temporary obstacles, Maltz says, are what make research so interesting.
“Every time you set out to build something,” he says, “you have a whole gamut of challenges, from formulating ideas, to setting out detailed designs you can code up and implement, to bringing together all the equipment to run the experiments and make them real.
“There are challenges in each of those aspects, and it’s part of the fun. It’s one of the reasons I am an experimental computer scientist. I actually get to build things and then see them run fast. That’s a lot of fun.”
As Greenberg says, the stakes are high.
“The biggest challenges are still to come,” he says. “As we get this thing on its feet, there are interesting and major Microsoft initiatives that got kicked off by this, and we are dedicated to making it work.”
Given the size of today’s data centers—and the size of the investment Microsoft and others are making to make them operate at peak efficiency—it’s no surprise that much remains to be done. The researchers are partnering with internal groups like Azure, GFS, and Bing, as well as enterprise customers and vendor collaborators. Prototype, Incubation, Production: That’s the mantra.
“To me,” Greenberg says, “success is having an impact on the way Microsoft runs its data centers. We’ve had some degree of academic success in getting our papers out there, and that’s great, but we really want to change the data-center network architecture and see it realized at Microsoft. In joint work with Bing, some of the ideas have been realized already.”
Patel says it’s about making the network provide what’s necessary for data-center servers to operate optimally and avoid diminishing that capacity.
“The data-center network supports the notion of agility,” he says, “and that’s what we want to see, that the properties that are posted in our data centers are able to programmatically acquire and release resources from a huge common pool, depending on the demand, and the network just works; it never gets in the way. They’re able to achieve their goal, and the network provides the capacity, the performance, and the security that’s necessary.”
And the potential, Maltz notes, could be enormous.
“The biggest data-analysis clusters built feasibly today have about 10,000 servers,” he says. “Even though we might have a data center with 100,000 servers in it, we can’t actually apply those to one data-analysis problem, even if we wanted to. If we’re successful, you could imagine taking all 100,000 servers and have them work on a very large-scale data-analysis problem.
“Any time you get an order-of-magnitude increase in the number of servers you can apply to any particular problem, that qualitatively changes the type of algorithms you can explore, the types of analysis you can do. I would hope this will have payoffs for other fields that depend heavily on data analysis, everything from biology to scientific computing.”
In addition to the six papers Microsoft Research is contributing to SIGCOMM 2009, three of its researchers are serving as session chairs during the conference. Stefan Saroiu of the Networking Research Group is chairing the Datacenter Network Design session, and teammate Ratul Mahajan will act as session chair for Performance Optimization. Thomas Karagiannis of the Cambridge Systems and Networking group at Microsoft Research Cambridge, will act as chair for the Network Management session.
Papers to be delivered during SIGCOMM 2009 that include authors from Microsoft Research:
BCube: A High Performance, Server-Centric Network Architecture for Modular Data Centers
Chuanxiong Guo, Microsoft Research Asia; Guohan Lv, Microsoft Research Asia; Dan Li, Microsoft Research Asia; Haitao Wu, Microsoft Research Asia; Xuan Zhang, Tsinghua University; Yunfeng Shi, Peking University; Chen Tian, Huazhong University of Science and Technology; Yongguang Zhang, Microsoft Research Asia; and Songwu Lu, UCLA.
De-Anonymizing the Internet Using Unreliable IDs
Yinglian Xie, Microsoft Research Silicon Valley; Fang Yu, Microsoft Research Silicon Valley; and Martín Abadi, Microsoft Research Silicon Valley and the University of California Santa Cruz.
Detailed Diagnosis in Enterprise Networks
Srikanth Kandula, Microsoft Research Redmond; Ratul Mahajan, Microsoft Research Redmond; Patrick Verkaik, the University of California, San Diego; Sharad Agarwal, Microsoft Research Redmond; Jitu Padhye, Microsoft Research Redmond; and Paramvir Bahl, Microsoft Research Redmond.
Matchmaking for Online Games and Other Latency-Sensitive P2P Systems
Sharad Agarwal, Microsoft Research Redmond; and Jacob R. Lorch, Microsoft Research Redmond.
VL2: A Scalable and Flexible Data Center Network
Albert Greenberg, Microsoft Research Redmond; Navendu Jain, Microsoft Research Redmond; Srikanth Kandula, Microsoft Research Redmond; Changhoon Kim, Princeton University; Parantap Lahiri, Microsoft; David A. Maltz, Microsoft Research Redmond; Parveen Patel, Microsoft Research Redmond; and Sudipta Sengupta, Microsoft Research Redmond.
White Space Networking with Wi-Fi Like Connectivity
Paramvir Bahl, Microsoft Research Redmond; Ranveer Chandra, Microsoft Research Redmond; Thomas Moscibroda, Microsoft Research Redmond; Rohan Murty, Harvard University; and Matt Welsh, Harvard University.