Data Center Genome Project

The DC Genome Project is a joint project between Microsoft Research (MSR) and Microsoft Global Foundation Services (GFS). The goal of the project is to use data-driven and feedback control approaches to monitor, analyze, and improve data center operation efficiencies, to maximize data center capacity utilization, and to minimize their environmental impacts.

Genomotes

Genomotes are customized wireless sensor nodes for data center environmental sensing. They use IEEE 802.15.4 wireless radio for communication. For ease of deployment and reduction of the number of contending wireless nodes, we take a master-slave chained design. The master node is a wireless node, which also has a serial interface to communicate with the slave nodes. A slave node has two serial interfaces, one up chain and one down chain.  

RACNet

RACNet is the network among the wireless Genomotes for data collection. Wireless sensor networking faces significant challenges in a data center environment. The number of nodes in the communication neighborhood can be very large. In our experience, between 50% to 80% nodes can hear (interfere) each other. RACNet used multiple communication channels and a token passing mechanism to avoid congestion in the network. We achieve more than 99.5% data yield in production deployments.

Cypress Data Management

One direct consequence of taking a data-driven approach for data center management is to deal with the massive amount of data generated from sensors (including soft sensors such as application performance counters) and other information sources. Cypress is a compressive data management framework for time series streams. It decomposes time series into multiple compressed feature streams (called trickles). Trickles can be further grouped together to take advantage of spatial correlation for more compression. Common queries such as select, trend, histogram, and correlations can be answered directly from compressed trickles rather than from reconstructing the raw data.  

Server Provisioning

Using the data collected from servers and their environments, we are looking at improving data center operation efficiency through static and dynamic server provisioning. RackPacker is a data-driven static provisioning approach by taking advantage of stationary and statistical variations of workload to improve provisioned power utilization. AutoShift is a dynamic provisioning approach to migrate workload to a minimum number of servers and turn off unnecessary servers. We use a seasonal time series regression technique for load prediction and dynamically skew the load to active servers (c.f. NSDI08 publication).

Joint Resource Control

The computing (cyber-) systems and the physical systems in a data center have their own distinct dynamics. A user request must be servers in milliseconds, while some facility components have a life time of over 15 years. How to organize across the nine-orders of magnitude is a great challenge for resource control purpose. We envision a holistic control framework where information and constraints are shared across the physical and computing boundaries to maximize energy saving potentials. For example, load balancers can be designed to give more load to the servers that can be easily cooled. Workload (and thus power) spikes can be clipped to protect UPS in an oversubscription environment. A critical component in this vision is the joint modeling of various dynamics (continuous time, discrete events, queueing, etc.) and a framework to analyze their interaction. 

Collaborators:

  • Microsoft Global Foundation Services: Mike Manos, Daniel Costello, Amaya Souarez, Patrick Yantz, Jeff O'Reilly, Kelly Roark, Sean James, Christian Belady, Phil Suver, Charl Kunzmann
  • Johns Hopkins University: Andreas Terzis
  • Harbin Institute of Technology (China): Qiang Wang
  • Intern Students: Gong Chen, Wenbo He, Mike Liang, Lakshmi Ganesh, Galen Reeves,  Sorabh Gandhi
Publications

News Coverage