Wei Chen, Xuezheng Liu, Yunni Xia, and Lidong Zhou
A failure detection service is perfect if it eventually detects all failures and every detection correctly identifies a failure that has already occurred. Such a perfect failure detection service serves as a basic building block for many reliable distributed systems, for example in primary/backup replication protocols and distributed lock services. In this paper, we present a comprehensive study on applying quorum systems to the perfect failure detection service in order to enhance the fault tolerance of the service. We provide the precise system model and specification for a quorum-based failure detection service. We prove that stable storage is necessary if the server processes may crash and recover in the middle of the service. We present two novel algorithms that implement the failure detection service and have complementary characteristics. We further develop a set of quality-of-service (QoS) metrics for quorum-based perfect failure detection services, and apply probabilistic analysis to quantify the QoS metrics of the two algorithms.