Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Improving Existing Fault Recovery Policies

Guy Shani and Christopher Meek

Abstract

Automated recovery from failures is a key component in the management of large data centers. Such systems typically employ a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we explain how to use data gathered from the interactions of the hand-made controller with the system, to create an optimized controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. While our paper focuses on a specific domain, our method is applicable to other systems that use a hand-coded, imperfect controllers.

Details

Publication typeProceedings
Published inAdvances in Neural Information Processing Systems
PublisherMIT Press
> Publications > Improving Existing Fault Recovery Policies