Model-Based Clustering for Online Crisis Identification in Distributed Computing

  • D. B. Woodard ,
  • Moises Goldszmidt

MSR-TR-2009-131 |

Distributed computing systems can suffer from occasional catastrophic violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is prohibitive. Recognizing the recurrence of a problem automatically can lead to cause diagnosis and / or informed intervention. We frame this as an online clustering problem, where the labels (causes) of some of the previous crises may be known. We give  an effective solution using model-based clustering based on a Dirichlet process mixture; the evolution of each crisis is modeled as a multivariate time series. We perform fully Bayesian inference on clusters, giving a method for efficient online computation. Such inferences allow for online expected-cost-minimizing decision making in the distributed computing context. We apply our methods to Microsoft’s Exchange Hosted Services.