Modern critical computer applications often require continuous and correct operation despite the failure of critical system components. In a distributed system, fault-tolerance can be achieved by creating multiple copies of the functionality and placing them at different processes. The core constitutes a distributed protocol run among the processes whose goal is to provide the end user with the illusion of sequentially accessing a single correct copy. Not surprisingly, the efficiency of the distributed protocol used has a severe impact on the application performance. This thesis investigates the cost associated with implementing fundamental abstractions constituting the core of service replication in asynchronous distributed systems, namely (a) consensus and (b) the read/write register. The main question addressed by this thesis is how efficient implementations of these abstractions can be. The focus of the thesis lies on time complexity (or latency) as the main effciency metric, expressed as the number of communication steps carried out by the algorithm before it terminates. Besides latency, important cost factors are the resilience of an algorithm (i.e. the fraction of failures tolerated) and its message complexity (the number of messages exchanged). Consensus is perhaps the most fundamental problem in distributed computing. In the consensus problem, processes propose values and unanimously agree on one of the proposed values. In a purely asynchronous system, in which there is no upper bound on message transmission delays, consensus is impossible if a single process may crash. In practice however, systems are not asynchronous. They are timely in the common case and exhibit asynchronous behavior only occasionally. This observation has led to the concept of unreliable failure detectors to capture the synchrony conditions sufficient to solve consensus. This thesis studies the consensus problem in asynchronous systems in which processes may fail by crashing, enriched with unreliable failure detectors. It determines how quickly consensus can be solved in the common case, characterized by stable executions in which all failures have reliably been detected, settling important questions about consensus time complexity. Besides consensus, the read/write register abstraction is essential to sharing information in distributed systems, also referred to as distributed storage for its importance as a building-block in practical distributed storage and le systems. We study fault-tolerant read/write register implementations in which the data shared by a set of clients is replicated on a set of storage base objects. We consider robust storage implementations characterized by (a) wait-freedom (i.e. the fact the read/write operations invoked by correct clients always return) and (b) strong consistency guarantees despite a threshold of object failures. We allow for the most general type of object failure, Byzantine, without assuming authenticated data to limit the adversary. In this model, we determine the worst-case time complexity of accessing such a robust storage, closing several fundamental complexity gaps.
|Institution||Technischen Universität Darmstadt|