Fingerprinting the datacenter: Automated classification of performance crises

When a performance crisis occurs in a datacenter, rapid recovery

requires quickly recognizing whether a similar incident occurred

before, in which case a known remedy may apply, or whether the problem

is new, in which case new troubleshooting is necessary. To address

this issue we propose a new and efficient representation of the

datacenter's state, a \emph{fingerprint}, that scales linearly with

the number of performance metrics considered and it is not affected by

the number of machines. These fingerprints are generated online and

then used as unique identifiers of the different types of performance

crises so that we can effectively recognize previous occurrences and

retrieve repair actions. We evaluate our approach on a production

datacenter with hundreds of machines running a 24x7 enterprise-class

user-facing application, verifying each identification result with the

operators of the datacenter and trouble-shooting tickets. Our approach

has $80\%$ identification accuracy in the operations-online setting

with time to detection below 10 minutes (our operators stated that

even 30 minutes into the crises is desirable), and offline

identification on the order of high $90\%$. To the best of our

knowledge this is the first time such an approach has been applied to

a large-scale production installation with such rigorous

verification. We compare our approach and show it is superior to various

alternatives to the construction of a fingerprint including an

adaptation to the datacenter setting of the signatures work


PDF file


> Publications > Fingerprinting the datacenter: Automated classification of performance crises