Byzantine fault isolation in the Farsite distributed file system

Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS) |

In a peer-to-peer system of interacting Byzantine-fault-tolerant replicated-state-machine groups, as system scale increases, so does the probability that a group will manifest a fault. If no steps are taken to prevent faults from spreading among groups, a single fault can result in total system failure. To address this problem, we introduce Byzantine Fault Isolation (BFI), a technique that enables a distributed system to operate with application-defined partial correctness when some of its constituent groups are faulty. We quantify BFI’s benefit and describe its use in Farsite, a peer-to-peer file system designed to scale to 100,000 machines.