Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems

Suman Nath; Haifeng Yu; Phillip B. Gibbons; Srinivasan Seshan

Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems

Suman Nath ,
Haifeng Yu ,
Phillip B. Gibbons ,
Srinivasan Seshan

NSDI'06: Proceedings of the 3rd conference on Networked Systems Design & Implementation | January 2006

Published by USENIX Association

Download BibTex

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IRISSTORE, a distributed read-write storage layer that provides high availability. Our results using IRISSTORE on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.