Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Structure from Failure

Ralf Herbrich, Thore Graepel, and Brendan Murphy

Abstract

We investigate the problem of learning the dependencies among servers in large networks based on failure patterns in their up-time behaviour. We model up-times in terms of exponential distributions whose inverse lifetime parameters lmay vary with the state of other servers. Based on a conjugate Gamma prior over inverse lifetimes we identify the most likely network configuration given that any node has at most one parent. The method can be viewed as a special case of learning a continuous time Bayesian network. Our approach enables us to easily incorporate existing expert prior knowledge. Furthermore our method enjoys advantages over a state-of-the-art rule-based approach. We validate the approach on synthetic data and apply it to five year data for a set of over 500 servers at a server farm of a major Microsoft web site.

Details

Publication typeInproceedings
Published inSecond Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML07)
PublisherUSENIX
> Publications > Structure from Failure