Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems
- Suman Nath ,
- Haifeng Yu ,
- Phillip B. Gibbons ,
- Srinivasan Seshan
NSDI'06: Proceedings of the 3rd conference on Networked Systems Design & Implementation |
Published by USENIX Association
High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IRISSTORE, a distributed read-write storage layer that provides high availability. Our results using IRISSTORE on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.