To invent & research technologies that make Microsoft’s networks, services and devices indispensable to the world
We strive to find a balance between long term research and product impact. You may know about our research from the papers we publish and the talks we give, here we share with you a few examples of the broad impact we have had on Microsoft products.
Our Big Hits
- Microsoft’s Wide Area Software Defined Network (implements a centralized traffic engineering system that has led to an improvement of the inter-DC WAN bandwidth utilization from 40% to 90%+, thus saving us millions of dollars annually)
- XBOX One Wireless Controller Protocol (a high throughput, low latency , energy efficient Microsoft propriety protocol between the XBOX One console and controllers. It has won accolades of mainstream press as the best controller in the gaming marker)
- Windows Azure Full-Bisection Bandwidth Datacenter Network (hailed as one of the most significant recent advances in computer science, our design led to an 80x improvement in dollars/Mbit/sec over previous designs. It is now the architecture of choice for all of Microsoft's Datacenters. It enabled technologies like the highly-scalable Windows Azure Flat Network Storage)
- Windows Azure Software Load Balancer (reduced cost by a factor of 15 [$60K versus $1M] by removing dependence on expensive hardware load balancers and improved cloud manageability. This fully configurable load balancer is used by both Azure and Bing)
- Windows Firmware TPM (enabled Microsoft to offer the widely used BitLocker and DirectAccess features and a new security feature, Virtual Smart Cards, in the Windows 8 RT and Windows 8 Phone)
- XBOX Live Service Graphs (reduced performance diagnostics in large-scale enterprise & Data Center networks from days to minutes helping meet customer SLAs. XBox Live is the first Microsoft cloud service to use use this network performance diagnosis technology)
- Windows Network Virtualzation (enabled Windows to provide seamless connectivity between Microsoft's Data Centers and customers’on-premise networks. Our design heavily influenced the Hyper V network virtualization feature that ships in Windows Server 2012)
- Windows Virtual Wi-Fi (enabled Windows features like range extension, concurrent corporate and guest access, and Internet gateway using a single Wi-Fi card. Before becoming a product, our prototype was downloaded several 100,000s times becoming the top three most popular MSR software download)
- TCP for Data Center Networking (improved performance of Data Centers networks without incurring cost for expensive hardware switches. It is implemented in our core networking stack and deployed in our Data Center properties)
Cloud and Enterprise Division (Azure, Servers, Visual Studio,...)
Microsoft’s Wide Area Network - Architecture & Management Software (2013-14)
Increases the inter-DC WAN utilization from 40% to 90%+
AutoPilot’s Network-state Management Service (2014)
Dramatically simplifies network management app development and operations while maintaining network-wide SLA
- DC network management applications are complex and sophisticated, usually requiring years to design, develop and deploy. Running multiple such applications is challenging as they may conflict with one another and their collective actions can impair network operation. The Autopilot Statesman service that we developed, simplifies application development by shielding apps from low-level interactions with devices. By offering a novel network state model, Statesman enables apps to operate independently while maintaining network-wide safety. Our technology has been deployed in all Autopilot managed datacenters.
A multi-tenant coordination cloud service that uses open source Zookeeper
- We worked closely with the AutoPilot team to build a multi-tenant layer on top of ZooKeeper that can be deployed and monitored by their software. The coordination service underneath runs multiple ensembles which can execute arbitrary requests from authenticated tenants. Although the ensembles are shared, to the user it seems as if they are running a dedicated ensemble. In future releases tenants will receive concrete compute resources tokens with full performance isolation.
End-to-end measurement and analysis tools that run automatically and vastly improve the accuracy of WAN fault localization.
- Localizing performance faults on WAN is difficult due to the plethora of routers and paths between DCs. We developed a system that accurately localizes faults to a specific router interface among thousands of candidate routers and paths. Compared to the state-of-the-art SNMP-based system, our system, called NetInsight, reduces the number of false positives by two orders of magnitude.
Network Virtualization for Hybrid Clouds (2010-12)
Enabled Windows to provide seamless connectivity between Microsoft's Data Centers and customers’on-premise networks
Visual Studio Energy Modeler & Profiler (2012)
- Poorly written apps are one of the primary reasons for high energy drain on mobile devices. One reason for energy-inefficient apps is that app developers do not have sufficient tools to determine the energy impact of their apps. As part of a Wattson research project we designed a Visual Studio plug-in that provides visibility to the application developer of their application’s energy consumption. Our paper Empowering Developers to Estimate App Energy Consumption, published in ACM MobiSys 2012 describes the details of the system. This work formed the basis for the Energy Profiler that is part of the Visual Studio SDK for Windows Phone 8.
Delivers significant power and monetary savings for enterprise customers by enabling seamless remote access to sleeping desktop machines
- Enterprise can save significant amounts of power by letting idle desktop machines go to sleep (S3). This behavior has been the default setting in Windows desktop for many years. However, users and system administrators often override this because they may need to access the machines remotely. Current wake-on-Lan technique are cumbersome, and do not always work on complex networks. We designed and built a wakeup service ("GreenUp") that works transparently – any time the user tries to remotely access a sleeping machine, it seamlessly wakes it up. This encourages users to save power. GreenUp scales up to large, complex corporate networks, by using a novel distributed leader election algorithm.
Fully Configurable Windows Azure Software Load Balancer (2011)
Reduced costs by a factor of 15 by removing dependence on hardware load balancers and improved cloud manageability as well
Full-Bisection Bandwidth Datacenter Networks (2009-10)
Servers in a datacenter are no longer limited by the network that connects them
TCP Analyzer (2010)
Enabled Microsoft Network Monitor to provide deeper insights into the working of Internet's Transport Control Protocol
- We designed and built a plugin (called an “expert”) for Microsoft Network Monitor (NetMon) that helps analyze TCP traces. It uses several sophisticated heuristics to answer the key question “what limited the throughput of this TCP connection”. Apart from answering this question, the plugin also allows the user to visualize the connection in a number of different ways. Our plugin has been downloaded thousands of time, and is one of the most popular NetMon “experts”.
Operating Systems Engineering (Windows, Phone, Windows Embedded)
Data Sense Bandwidth Attribution Technology (2012)
- We built a technology that tracks cellular and Wi-Fi data consumption for individual apps and OS components, and displays it in an intuitive UI. A challenge we had to overcome was to accurately attribute data consumption across the numerous APIs and OS services that mobile apps use, and to do so in a lightweight manner. See the original technology demo video (Aug. 2011).
- Typing intelligence: We enabled Windows Phone to scale their typing intelligence solutions (hit-target resizing, spell correction, candidates-on-demand, etc.) to over 50+ languages, including new languages such as Latin Hindi.
- WordFlow Keyboard User Adaptation: We helped with a feature that allows Windows Phone keyboard to adapt to the users' language and offer their words as completions and next word predictions.
- Keyboard Input Architecture: We helped revise the input architecture and created a new edit buffer to facilitate new features such as user adaptation, multilingual editing within the same message, and seamless multi-modal integration.
- We delivered the TPM driver and firmware TPM simulator. The development team used our simulator to develop & test important security features even before the vendors provided them the actual devices. A better description of our contribution is provided under "Windows".
- We delivered a customized version of our application analytics tool for performance testing and failure analysis of the top WP marketplace applications on various hardware and software SKUs. The development team run this tool routinely on third-party apps and they estimate to have reduced the time spent on app. failure analysis by a factor of 2 to 4. The first paper (AppInsight: Mobile App Performance that describes our system appeared in OSDI 2012.
Increases battery lifetime in Windows 8 Tablets and Surface computers.
- Compared to laptops the new class of mobile devices, such as tablets and Surface computers, need to stay connected even when the screen is turned off. Keeping the Wi-Fi always on consumes significant energy. We designed a set of techniques that allows the Wi-Fi device to not lose its connection even when the screen is turned off and the processor (and SoC) is in a low power state. We accomplished this by reducing the Wi-Fi power consumption to a few mW in standby state. Our techniques shipped in Windows 8.
Enabled Windows 8 to provide predictable networking to high-value cloud services
- We helped design and evaluate a mechanism to adaptively control the network usage of a Virtual Machine (VM), analogous to equivalent controls that existed for CPU and memory. Our design includes a feedback loop that ensures VMs receive network bandwidth that is proportional to their share and that spare bandwidth is allocated among VMs that need it. Our VM Rate Shaper shipped in Windows 8.
Support for Security Features in Windows ARM (2011-12)
Enabled widely used security features (BitLocker, DirectAccess, Virtual SmartCards) on Windows RT and Windows Phone
Antenna Placement on Windows Tablet (2011-12)
Enabled best-in-class Wi-Fi network connectivity & performance
- We helped design the antenna placement on tablet devices. Since users hold tablets differently than laptops, existing antenna placement techniques (on the laptop’s screen) are not the most optimal for tablets. The placement of a user’s hand around the antenna might reduce the signal, and so can the orientation in which the tablet is held. We studied these phenomena in detail – in the wild and in antenna chambers – and made recommendations to the Windows 8 team, which were incorporated in the final design of Windows 8 tablets.
Improved network perfomance in Data Centers with inexpensive switches
- We designed a new variant of TCP, called Datacenter TCP (DCTCP) to address congestion control issues in datacenter networks. DCTCP leverages Explicit Congestion Notification (ECN) and a simple mOLti-bit feedback mechanism at the host to reduce application latencies by overcoming network impairments such as queue buildup, buffer pressure, and incast. DCTCP was designed in close collaboration with the Windows Networking Team and it ships in the Windows 8 networking stack. The initial paper ( Datacenter TCP) was published in SIGCOMM 2010.
Enabled Windows to connect to multiple WLANs simultatenously and offer range extension, concurrent corporate and guest connection, and Internet gateway features
- We designed a technique to virtualize wireless LAN (WLAN) cards. With it users can concurrently connect to multiple Wi-Fi networks using a single WLAN card, thus enabling several novel scenarios. The original paper. The original paper ( MultiNet: Connecting to Multiple IEEE 802.11 Networks Using a Single Wireless Card) was published in INFOCOM 2004. Our mini-port driver was downloaded by over hundred thousand developers and was one of Microsoft Research’s most popular software downloads. Virtual Wi-Fi first shipped in Windows 7.
Enabled Windows to offer a better media streaming experience over Wi-Fi
- We developed a technique ("Probe-Gap") to estimate the capacity and the available bandwidth of network paths based on end-point measurements. The problem was particularly difficult for cable modems and Wi-Fi networks because they do not have point-to-point links. For example, they employ mechanisms such as token bucket rate regulation; non-FIFO scheduling, and multiple rate. The initial paper ( Bandwidth Estimation in Broadband Access Networks) describing the problem was published in IMC 2004. Probe-Gap first shipped in Windows XP.
Elevated Wireless LAN connectivity to a premier consumer networking technology in Windows
- We designed the (first set of) NDIS WLAN OID for Windows 200 and beyond. Prior to our contribution Windows exposed a wireless LAN network adapter as an Ethernet network adapter. We enhanced the programming interface exposed by the Network Device Interface Specification (NDIS) and WinSock which then enabled novel wireless-aware and mobility-aware applications.
Applications and Services Engineering (Bing, Skype, Office, Outlook,...)
NetPilot reduces the recovery time for the common Data Center network failures from a few hours to tens of minutes
- Handling network failures is one of the most challenging tasks for Data Center operators. Different from the conventional failure diagnosis and repair process which requires significant human intervention, Our NetPilot technology mitigates failures by deactivating or restarting the suspect network devices without the need for knowing the exact root causes. By enabling automatic failure mitigation, Netpilot dramatically reduces the recovery time for common network failures. Our initial paper (NetPilot: Automating Datacenter Network Failure Mitigation) describing this system was published in SIGCOMM 2012 and we shipped as part of the Bing Metallica Release in June 2012
Faster load times leads to better user experience
- We performed a comprehensive analysis of the page load time in Bing to help uncover and explain strange effects such as Page Load Time (PLT) increase during off-peak hours and the impact of browser population and query type. These insights were used to develop a more precise and detailed alerting tool for PLT degradation. We documented some of our learnings in a SIGCOMM 2013 paper (A provider-side view of Web Search Response Time).
Our congestion prediction technology enables mitigation strategies that lead to better application performance
- In distributed file systems, when one storage node is congested both read- and write- traffic can be steered to other replicas and other nodes with empty space. If the on-set of such congestion is detected quickly, one can avoid needless queuing lags and improve overall store throughput. We helped design a predictor that uses current load and historical performance to predict the congestion status of storage nodes in Cosmos early. As a side-benefit, this also serves as a measure of application-perceived capacity of the distributed storage layer and a monitor of current usage and hotspots. Our technology is shipping in Cosmos clusters in Bing since December 2011.
This technology significntly reduced the response times of large jobs in our Data Centers
- Performant execution of data-parallel jobs needs good execution plans. Certain properties of the code, the data, and the interaction between them are crucial to generate these plans. Yet, these properties are difficult to estimate due to the highly distributed nature of these frameworks. We built the first reoptimizer for data-parallel jobs. It collects certain code and data properties by piggybacking on job execution and adapts execution plans by feeding these properties to a query optimizer. Our technology shipped in Bing's Cosmos clusters in December 2011. and it has significantly improved the response times on production jobs.
- Laggard tasks signicantly prolong the completion time for data-parallel jobs. The causes for such outliers include run-time contention for processor, memory and other resources, disk failures, varying bandwidth and congestion along network paths and imbalance in task workload. We buit a system that monitors tasks and culls outliers by restarting tasks, network-aware placement of tasks and protecting outputs of valuable tasks. The result was a significant improvement of job completion time. Our technology is in production use across all of the Cosmos clusters in Bing since May 2010.
- To answer some of the basic questions about real workloads we built NetTrace, a network tracing service for large data center clusters. This service collects low level networking logs (socket-level) and uploads to COSMOS. Processing the data yields a much better understanding of the traffic patterns of operational workloads and also helps diagnose whether the network or the application is to blame for poor performance. NetTrace ships as an autopilot service in Bing since 2009. We also shipped an analysis suite and Bing continues to invest in NetTrace, in their June 2012 release, they expanded the types of data captured, lowered resource consumption of the logger and are in the process of rolling it out as an always-on service.
- We conducted a series of experiments to measure the DNS query resolution time for Bing. Based on these measurement we came up with a set of improvements to our DNS query chain. Worked closely with Bing we deployed these improvements and in the process reduced the median DNS query time by more than half of previous amount. More importantly, the 95th percentile was cut in half.
Enabled MSN web properties to better handle spikes in load (flash crowds)
- The MSN Publishing Platform serves billions of web pages a month. As they grew, scalability bottlenecks started to show up in their previous architecture. The Scalable and Consistent Caching (SCC) technology allowed them to solve these bottlenecks while maintaining the strict consistency semantics that content publishers expect, such as adding breaking news to a web page and all viewers seeing the updated content.
Enabled Live Mesh (now SkyDrive) backend cloud services to scale resiliently
- The Live Mesh data center services need to partition user data across a large number of servers. We designed and built the Partitioning and Recovery Service (PRS), which became their mechanism for doing this. The PRS made the development of the server code easier by providing a number of novel properties, such as strong consistency for soft state and guaranteed notifications to trigger state republishing. Microsoft's Live Mesh product won CNET's best technology innovation/achievement award.
The technology behind Microsoft Forefront risk analysis and mitigation planning feature
- We developed a technique which evaluates the risk to an organization based on patterns of user privilege and access. Attackers use accounts to compromise machines and use machines to compromise accounts. In the absence of explicit management to mitigate this risk, growth in jumping from one machine to another via a compromised account is exponential. In a test, corroborated by our graph-based analysis, we found that over 70% of the machines investigated yielded at least one account that granted control over 100 other machines on the next hop. Our system performs static analysis and generates pre- and post- incident reports for planning risk mitigation strategies. We shipped our technology in the Access and Security Division’s (ASD) ForeFront Product Suite.
Devices and Studios (XBox, XBox Live, Hardware, Surface...)
XBOX One Wireless Controller Protocol (2013-14)
The wireless protocol between the XBOX One controllers and the console
Service Graphs for Large-Scale Network Diagnostics (2012)
Helps meet customer service level agreements (SLAs) by quickly identifying faltering components, reducing down time from days to minutes