Model-Based Clustering for Online Crisis Identification in Distributed Computing

Distributed computing systems can suffer from occasional catastrophic violation

of performance goals; due to the complexity of these systems, manual diagnosis of the

cause of the crisis is prohibitive. Recognizing the recurrence of a problem automatically

can lead to cause diagnosis and / or informed intervention. We frame this as an online

clustering problem, where the labels (causes) of some of the previous crises may be

known. We give an effective solution using model-based clustering based on a Dirichlet

process mixture; the evolution of each crisis is modeled as a multivariate time series.

We perform fully Bayesian inference on clusters, giving a method for efficient on-

line computation. Such inferences allow for online expected-cost-minimizing decision

making in the distributed computing context. We apply our methods to Microsoft’s

Exchange Hosted Services.

ehsStats (4).pdf
PDF file

Details

TypeTechReport
NumberMSR-TR-2009-131
> Publications > Model-Based Clustering for Online Crisis Identification in Distributed Computing