Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Peering into Future of Cloud Computing
February 24, 2009 10:00 AM PT

On Feb. 24, Rick Rashid, senior vice president of Microsoft Research, announced a new research organization called Cloud Computing Futures (CCF), focused on reducing the operational costs of data centers and increasing their adaptability and resilience to failure. The group, led by Dan Reed, director of Scalable and Multicore Systems, will strive to lower hardware costs, power consumption, and the environmental impact of such facilities. Reed recently found time to discuss the new venture.

Q:What are you announcing, and why now?

Reed: Cloud Computing Futures is a new initiative in Microsoft Research to improve the efficiency of the scalable computing hardware and software infrastructure needed to deliver cloud services. Data centers and their services have grown in size and importance as Microsoft has shifted to a software-plus-services model in which an increasing number of new applications run in part, or entirely, in the “cloud” and are delivered to clients via the Internet.

Dan Reed
Dan Reed

Microsoft and other competitors, such as Yahoo, Amazon, Google, and IBM, have been building cloud-computing infrastructure and new software at a rapid pace to service the large number of potential users. Microsoft’s business now depends on an ever-expanding network of massive data centers: hundreds of thousands of servers, petabytes of data, hundreds of megawatts of power, and billions of dollars in capital and operational expenses. Because these data centers are being built with hardware and software technologies not designed for deployment at such massive scale, many of today’s data centers are expensive to build, costly to operate, and unable to provide all the services needed for emerging applications—resilience, geo-distribution, composability, and graceful recovery.

The goal of the CCF project is to identify, create, and evaluate new, potentially disruptive innovations that can enable new software and application capabilities while also reducing the cost of building and operating cloud services. The CCF project started with a key concept: treat the data center as an integrated system—a holistic entity—and optimize all aspects of hardware and software. As a result of this work, Microsoft will be able to deliver a wider range of new, innovative services more efficiently.

This work builds on deep technical partnerships and collaborations across Microsoft—Microsoft Research, Global Foundation Services, Cloud Infrastructure Services/Azure™, and product teams—and we are working with an array of hardware-technology providers and companies.

Q: From a broad perspective, what is driving the research you are pursuing?

Reed: Two broad factors drive this research. The first is the shift by Microsoft and the software industry to delivering services along with their software. The term “services” encompasses a broad array of Internet delivery options that extend far beyond browser access to remote Web sites. At one end are Web 1.0 applications—Hotmail®, Messenger, search, and online commerce sites—and Web 2.0 applications—social networking, for example. An emerging suite of more sophisticated applications, such as business intelligence and rich games, are improved fundamentally when local clients are connected to services. Such connections enable entirely new features such as a new generation of immersive, interactive games; augmented-reality tools; and real-time data analysis and fusion. To provide services, a company must have a large number of computers housed in one or more data centers.

The second factor driving this research is the way cloud services and their support infrastructures are constructed. Today, they are assembled from vast numbers of PCs, packaged slightly differently, connected by the same networks used to deliver Internet services. Building data centers using standard, off-the-shelf technology was a great choice in the beginning. It let the Internet boom race ahead without the need to develop new types of computers and software systems. But the resulting data centers and software were not designed as integrated systems and are less efficient than they should be. One common analogy is that if one built utility power plants as we build data centers, we would start by going to Home Depot and buying millions of gasoline-powered generators.

Energy efficiency and green computing are always at the forefront of our research, as being in harmony with our environment is a key tenet for the group—and for Microsoft, as well.

Many researchers have seen an opportunity to make major improvements in the way data centers and cloud services are built, but this type of research and technology transfer is difficult because the efforts often cross many research disciplines. Effective research requires changes to both hardware and software, and the resulting prototypes must be constructed and tested at a scale difficult for small teams. For this reason, the CCF team is taking an integrated approach, drawing insights and lessons from Microsoft’s production services and data-center operations, and partnering with researchers and product teams worldwide.

Q: What’s the career arc that led you to focus on data centers?

Reed: I have spent 25 years in academia, leading research groups in high-performance computing and spearheading creation of the world’s largest unclassified computing infrastructure for scientific research, the National Science Foundation’s TeraGrid. As leader of the National Center for Supercomputing Applications, the birthplace of the Web browser, I also helped elevate commodity clusters as the primary computing platform for computational science. I’ve been an active leader of national science policy, having recently chaired the review of U.S. computing research policy for the federal government. This background in science and technology policy, hardware and software research, and deployment of large-scale computing infrastructure gives me the background to integrate hardware, software, applications, and policies in industry partnerships to create the next generation of cloud-services infrastructure.

Q: When did this work begin?

Reed: The CCF project began a little over a year ago, when I joined Microsoft after meetings with Craig Mundie and Rick Rashid. Craig and Rick asked me to create a research-and-development team to explore the design of cloud-services infrastructure. We hired Jim Larus as our second member about 10 months ago and have grown rapidly since then.

Q: What do you hope to achieve in the near term, and what are your strategies for getting there?

Reed: The commodity components and handcrafted software currently used to build cloud services introduce costly inefficiency into Microsoft’s business. Designs based on comprehensive optimization of all attributes offer an opportunity to create novel solutions that produce fundamental improvements in efficiency:

  • Creating new hardware and software prototypes.
  • Advancing the holistic design philosophy.
  • Innovating with instrumentation and measurement, data acquisition, and analysis.
  • Engaging Microsoft product groups and outward-facing properties.

Our goal is to reduce data-center costs by fourfold or greater while accelerating deployment and increasing adaptability and resilience to failures, transferring ideas into products and practice. To date, we have focused our attention on four areas, though our agenda spans next-generation storage devices and memories, new processors and processor architectures, system packaging, and software tools:

  • Low-power services: The computers (“servers”) used to support cloud services are some of the fastest, most power-hungry computers built. The common wisdom has been to use the fastest computers because the workload is potentially huge and purchasing, installing, maintaining and operating computers is a complex task, so the fewer the machines, the better. But other computers, such as laptops, are far more energy-efficient, as measured in operations per joule, and can complete a unit of work with far less electricity and less cooling. These computers are not as fast as servers, though, and more of them are required to deliver the same service.

CCF has built two server clusters using low-power, Intel Atom chips and is conducting a series of experiments to see how well they support cloud services and how much their use can reduce the power consumed by those services. For example, power-efficient computers have low-power states, such as a laptop’s sleep and hibernate modes, that greatly reduce power consumption. We have built an intelligent control system called Marlowe that examines the workload on a group of computers and decides how many of them should be asleep at any time to reduce power consumption while still meeting the service’s acceptable level of performance.

In addition, we have worked with the Hotmail® team to evaluate the utility of low-power servers for the Hotmail® service. These experiments—the Cooperative Expendable Micro-Slice Servers prototype—have shown that overall power consumption can be reduced compared with standard servers while still delivering the same quality of service.

  • Improved networks: The networks that connect the computers in data centers use the same hardware and software as the rest of the Internet. It is great technology, but many of the design decisions that make it possible to transmit traffic across the globe to a vast, rapidly changing collection of computers are inappropriate for a cloud-service computing infrastructure consisting of a large, but fixed, collection of computers in a single room. Data-center networks are costly and impose many constraints on communications among data-center services, making writing cloud-service software far more difficult.

We have been working with researchers from Microsoft Research on several approaches to data-center networking. The most mature of these is Monsoon, which uses much of the existing networking hardware but replaces the software with a new set of communications protocols far better suited for a data center. This work will not only lead to more efficient networks, but by relaxing the constraints of existing networks, it also will open new possibilities to simplify data-center software and to build more robust platforms.

  • Orleans software platform: The software that runs in the data center is a complicated, distributed system. It must handle a vast number of requests from across the globe, and the computers on which the software runs fail regularly—but the service itself should not fail, even though the software is continually changing as the service evolves and new features are added. Orleans is a new software platform that runs on Microsoft’s Windows® Azure™ system and provides the abstractions, programming languages, and tools that make it easier to build cloud services.
  • Future cloud applications: To test the CCF hardware prototypes and the Orleans software platform, we are exploring future application scenarios that go beyond our current cloud workloads. These scenarios integrate many ideas from across Microsoft in areas such as computer vision, virtual reality, and natural-language processing.

Q: How do you assess the progress you have made thus far?

Reed: We’ve gotten off to a fast start. Our initial efforts validate the approach of combining hardware and software innovation. The Marlowe system would not have been possible without building our own low-power services, and Monsoon requires both hardware and software innovation to develop a new network. We have working prototypes of both systems, and we will be demonstrating both during TechFest this year. We already are designing their successors with the benefit of our experience building and measuring these systems.

Q: What sorts of challenges are you encountering?

Reed: Our challenges are complexity, scale, and the rapid pace of change. Complexity arises from the deep interdependence of design choices in infrastructure, hardware, service software, and applications. Changes in any one can affect the others. Hence, a major element of our work is measuring the effects of these interdependencies by constructing a series of prototypes, each of which tests one or two new ideas while holding other aspects constant. The best of these ideas then will be combined and evaluated again.

The sheer scale of cloud infrastructure makes testing ideas challenging. Many of the issues only arise at scale, and the prototypes must be large enough to be tested using realistic workloads in current environments.

Finally, the rate of ferment and change in software services means we are chasing a moving target. We must be careful not to design tomorrow’s solutions for yesterday’s applications. Hence, we are tracking the evolution of applications, just as we track the change in hardware and software technologies.

Q: Tell me about your demos in this year’s TechFest.

Reed: The first uses low-power Intel Atom processors originally designed for use in netbooks and other mobile applications. This experiment built a server from these low-power processors to evaluate their effectiveness on typical cloud-service tasks. In addition to requiring far less energy—5 watts vs. 50 to 100 watts for a processor typically used in a data center—low-power processors also have quiescent states that consume little energy and can be awakened quickly. These states are used in the sleep and hibernate features of laptops and netbooks. With our current Atom processor, its energy consumption when running is 28 to 34 watts, but in the sleep or hibernate state, it consumes 3 to 4 watts, a reduction of 10 times in the energy consumption of idle processors.

The other demonstration highlights the power of an intelligent control system that can determine when to put a processor to sleep and when to awaken it to service the workload. This problem has two interesting challenges. The first is to estimate how many processors are necessary to handle a given workload by responding to every request in a timely manner. (By analogy, how many checkout clerks should be at the cash registers?) The second is to anticipate the workload in the near future, since it takes 5 to 15 seconds to awaken a processor from sleep and 30 to 45 seconds for hibernate. The system needs to hold some processors in reserve and to anticipate the workload 5 to 45 seconds in the future to ensure that sufficient servers are available.

We have solved this problem with a simple, closed-loop control system. It works by taking regular measurements of the system, such as CPU utilization, response time, and energy consumption; combining this data with the estimated future workload; then adjusting the number of servers in each power state.

Q: Over the long haul, if your work is successful, how will the future look different from today?

Reed: The CCF approach is to construct a series of prototypes, each testing a small number of ideas, but driven by a coherent vision that culminates in an integrated technology suite. The result will be new hardware and software infrastructure that simplifies cloud and software-plus-services application development, lowers data-center capital and operating costs, and shapes both vendor technologies and Microsoft internal practices.

From the consumer’s perspective, if we are successful, software can be delivered in a new, better manner. You’ll still have computers—probably many more than today, in many different forms—but all of your information will be available on all of the computers, without a thought to installing or upgrading the software on any of your devices. The dividing line between what runs on your computer and what runs in the cloud will not be apparent to most users, and it might vary depending on factors such as how much battery power remains in your device or the state of the network.

From the perspective of the cloud-service provider, we expect to see many more cloud-service infrastructures scattered throughout the world, to provide better service to all countries. They will be far more power-efficient than today; the software will be more resilient, adaptive, and reliable and will require far less effort to install, maintain, and repair.