Peter Bodik, Moises Goldszmidt, Armando Fox, and Hans Andersen
5 July 2009
When a performance crisis occurs in a datacenter, rapid recovery
requires quickly recognizing whether a similar incident occurred
before, in which case a known remedy may apply, or whether the problem
is new, in which case new troubleshooting is necessary. To address
this issue we propose a new and efficient representation of the
datacenter's state, a \emph{fingerprint}, that scales linearly with
the number of performance metrics considered and it is not affected by
the number of machines. These fingerprints are generated online and
then used as unique identifiers of the different types of performance
crises so that we can effectively recognize previous occurrences and
retrieve repair actions. We evaluate our approach on a production
datacenter with hundreds of machines running a 24x7 enterprise-class
user-facing application, verifying each identification result with the
operators of the datacenter and trouble-shooting tickets. Our approach
has $80\%$ identification accuracy in the operations-online setting
with time to detection below 10 minutes (our operators stated that
even 30 minutes into the crises is desirable), and offline
identification on the order of high $90\%$. To the best of our
knowledge this is the first time such an approach has been applied to
a large-scale production installation with such rigorous
verification. We compare our approach and show it is superior to various
alternatives to the construction of a fingerprint including an
adaptation to the datacenter setting of the signatures work
in~\cite{sosp05}.
![]() PDF file |
| Type | TechReport |
| Number | MSR-TR-2009-122 |