*
Quick Links|Home|Worldwide
Microsoft*
Search for


Eclipse

Overview

Reliable distributed systems are typically designed to be fault tolerant. Fault tolerance mechanisms provably ensure system correctness, but only with respect to a system model that specifies the type and extent of failures. Most of the time, the system exists in a normal state, with no faulty components or by tolerating the failures of some components. Unfortunately, systems can sometimes suffer excessive failures that go beyond what is allowed in the system model. In these cases, fault tolerance mechanisms enter an abnormal state, are unable to mask failures, and cause "reliable" systems to fail.

In the Eclipse project, we are investigating a different approach to designing reliable distributed systems. We believe that massive failures are inevitable, albeit infrequent, in real-world, large scale deployments, and feel that distributed system design principles must embrace this reality. We believe a system should provide degradable service only when there are massive failures, must work efficiently with strict guarantees when there are few failures, must be able to transition between these operating regimes, and be able to recognize in which of these regimes it is currently operating. We advocate a new paradigm for building systems by augmenting fault tolerance with the properties of graceful degradation, self-awareness, and self-restoration.

Project Members

 

Publications

  • A position paper outlining our initial ideas can be found here.
  • Click here for power point slides that describe the evolution of Eclipse.

Associated Groups
 


©2008 Microsoft Corporation. All rights reserved. Terms of Use |Trademarks |Privacy Statement