Fingerprinting the datacenter: Automated classification of performance crises

Moises Goldszmidt

Fingerprinting the datacenter: Automated classification of performance crises

Moises Goldszmidt

MSR-TR-2009-122 | July 2009

Download BibTex

When a performance crisis occurs in a datacenter, rapid recovery requires quickly recognizing whether a similar incident occurred before, in which case a known remedy may apply, or whether the problem is new, in which case new troubleshooting is necessary. To address this issue we propose a new and efﬁcient representation of the datacenter’s state, a ﬁngerprint, that scales linearly with the number of performance metrics considered and it is not affected by the number of machines. These ﬁngerprints are generated online and then used as unique identiﬁers of the different types of performance crises so that we can effectively recognize previous occurrences and retrieve repair actions. We evaluate our approach on a production datacenter with hundreds of machines running a 24×7 enterprise – classuser – facing application, verifying each identiﬁcation result with the operators of the datacenter and trouble-shooting tickets. Our approach has 80% identiﬁcation accuracy in the operations-online setting with time to detection below 10 minutes (our operators stated that even 30 minutes into the crises is desirable), and ofﬂine identiﬁcation on the order of high 90%. To the best of our knowledge this is the ﬁrst time such an approach has been applied to a large-scale production installation with such rigorous veriﬁcation. We compare our approach and show it is superior to various alternatives to the construction of a ﬁngerprint including an adaptation to the datacenter setting of the signatures work innatives to the construction of a fingerprint including an adaptation to the datacenter setting of the signatures work in.