Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems

  • ,
  • Haifeng Yu ,
  • Phillip B. Gibbons ,
  • Srinivasan Seshan

NSDI'06: Proceedings of the 3rd conference on Networked Systems Design & Implementation |

Published by USENIX Association

High availability is widely accepted as an explicit requirement for distributed storage systems. Tolerating correlated failures is a key issue in achieving high availability in today’s wide-area environments. This paper systematically revisits previously proposed techniques for addressing correlated failures. Using several real-world failure traces, we qualitatively answer four important questions regarding how to design systems to tolerate such failures. Based on our results, we identify a set of design principles that system builders can use to tolerate correlated failures. We show how these lessons can be effectively used by incorporating them into IRISSTORE, a distributed read-write storage layer that provides high availability. Our results using IRISSTORE on the PlanetLab over an 8-month period demonstrate its ability to withstand large correlated failures and meet preconfigured availability targets.