Distributed Systems (Silicon Valley)

Our research activities in the distributed system area range from protocols and algorithms to decentralized architectures and services. We address fundamental issues such as scalability, fault-tolerance, security, manageability, and mobility. We also explore the basic building blocks for distributed systems including networks and operating systems.

Current Projects
  • Pileus
    This research project in MSR SVC aims to answer the following question: Can we allow programmers to write cloud applications as though they are accessing centralized, strongly consistent data while at the same time allowing them to specify their consistency/availability/performance (CAP) requirements in terms of service-level agreements (SLAs) that are enforced by the cloud storage system at runtime?
    CORFU (Clusters of Raw Flash Units) is a cluster of network-attached flash exposed as a global shared log. CORFU has two primary goals. As a shared log, it exploits flash storage to alter the trade-off between performance and consistency, supporting applications such as databases at wire speed. As a distributed SSD, it slashes power consumption and infrastructure cost by eliminating storage servers.
  • Dandelion
    The goal of the Dandelion project is to provide simple programming abstractions and runtime supports for programming heterogeneous systems. Dandelion supports a uniform sequential programming model across a diverse array of execution contexts, including CPU, GPU, FPGA, and the cloud.
  • TimeStream: Large-Scale Real-Time Stream Processing in the Cloud
    TimeStream is a distributed system designed specifically for low-latency continuous processing of big streaming data on a large cluster of commodity machines. The unique characteristics of this emerging application domain have led to a significantly different design from the popular MapReduce-style batch data processing. In particular, we advocate a powerful new abstraction called resilient substitution that caters to the specific needs in this new computation model.
  • MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud
    The computation core of many data-intensive applications can be best expressed as matrix computations. The MadLINQ project addresses the following two important research problems: the need for a highly scalable, efficient and fault-tolerant matrix computation system that is also easy to program, and the seamless integration of such specialized execution engines in a general purpose data-parallel computing system.
  • Naiad
    The Naiad project is an investigation of data-parallel dataflow computation, like Dryad and DryadLINQ, but with a focus on low-latency streaming and cyclic computations. Naiad introduces a new computational model, timely dataflow, which combines low-latency asynchronous message flow with lightweight coordination when required. These primitives allow the efficient implementation of many dataflow patterns, from bulk and streaming computation to iterative graph processing and machine learning.
  • Online-Service Security and Intelligence
    We take a data-driven approach to enhancing the security and other aspects of large-scale online services, including for instance email services, search engines, and advertising systems. We explore network-host properties (e.g., the use of proxy servers and dynamically assigned IP addresses), service-level topologies, and user social connectivity. We correlate all this fine-grained information with application-specific traces, for attack defense and for improving services.
  • Optimus
    Optimus is a framework for dynamically rewriting an execution plan graph in distributed data-parallel computing at runtime. It enables optimizations that require knowledge of the semantics of the computation, such as language customizations for domain-specific computations including matrix algebra. We address several problems arising in distributed execution including data skew, dynamic data re-partitioning, unbounded iterative computations, and fault tolerance.
  • Pasture
    Mobile user experiences are enriched by applications that support disconnected operations to provide better mobility, availability, and response time. However, offline data access is at odds with security when the user is not trusted, especially in the case of mobile devices, which must be assumed to be under the full control of the user. Pasture leverages commodity trusted hardware to provide secure offline data access by untrusted users.
  • Plexus: Interactive Multiplayer Games on Windows Phones and Tablets
    As smart phones and tablets are becoming popular gaming devices, there is a need to support real-time, interactive games more effectively. Plexus is a platform to support interactive multiplayer games on Windows phones. It leverages direct phone-to-phone connections and latency-masking techniques to provide smooth, power-efficient interactive game play.

Microsoft Product Engagement

Here are just a few examples of the way that our work has impacted Microsoft products:

  • Michael Isard worked with MSN Search on the design and implementation of their production search engine in areas including fault tolerance and automatic management; scalable query processing; and distributed data-mining.
  • Chandu Thekkath worked closely with the Hotmail product team in implementing scalable, easy to manage, and cost effective solutions for storing and retrieving e-mail.
  • Doug Terry worked with the WinFS team on the design of a novel and efficient replication protocol that is now the core of the Microsoft Sync Framework.