Eclipse is investigating the idea of building distributed systems to be gracefully degradable in the presence of massive failures.
Overview
Reliable distributed systems are typically designed to be fault tolerant. Fault tolerance mechanisms provably ensure system correctness, but only with respect to a system model that specifies the type and extent of failures. Most of the time, the system exists in a normal state, with no faulty components or by tolerating the failures of some components. Unfortunately, systems can sometimes suffer excessive failures that go beyond what is allowed in the system model. In these cases, fault tolerance mechanisms enter an abnormal state, are unable to mask failures, and cause "reliable" systems to fail.
In the Eclipse project, we are investigating a different approach to designing reliable distributed systems. We believe that massive failures are inevitable, albeit infrequent, in real-world, large scale deployments, and feel that distributed system design principles must embrace this reality. We believe a system should provide degradable service only when there are massive failures, must work efficiently with strict guarantees when there are few failures, must be able to transition between these operating regimes, and be able to recognize in which of these regimes it is currently operating. We advocate a new paradigm for building systems by augmenting fault tolerance with the properties of graceful degradation, self-awareness, and self-restoration.
