Moises Goldszmidt, Mihai Budiu, Yue Zhang, and Michael Pechuk
17 September 2009
In order to be economically feasible and to offer high levels of availability and performance, large scale distributed systems depend on the automation of repair services. While there has been considerable work on mechanisms for such automated services, a framework fore evaluating and optimizing the policies governing such mechanisms has been lacking. In this paper we propose one such framework and report on our initial experience in applying the framework to analyze and optimize the operation of a geo-distributed cloud storage system at Microsoft.
In The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware