Dawn B. Woodard and Moises Goldszmidt
Distributed computing systems can suffer from occasional catastrophic violation of performance goals; due to the complexity of these systems, manual diagnosis of the cause of the crisis is prohibitive. Recognizing the recurrence of a problem automatically can lead to cause diagnosis and / or informed intervention. We frame this as an online clustering problem, where the labels (causes) of some of the previous crises may be known. We give an effective solution using model-based clustering based on a Dirichlet process mixture; the evolution of each crisis is modeled as a multivariate time series. We perform fully Bayesian inference on clusters, giving a method for efficient on- line computation. Such inferences allow for online expected-cost-minimizing decision making in the distributed computing context. We apply our methods to Microsoft’s Exchange Hosted Services.