Armando Fox, Emre Kıcıman, David Patterson, Michael Jordan, and Randy Katz
October 2004
Complex distributed Internet services form the basis not
only of e-commerce but increasingly of mission-critical networkbased
applications. What is new is that the workload and internal
architecture of three-tier enterprise applications presents
the opportunity for a new approach to keeping them running
in the face of many common recoverable failures. The core
of the approach is anomaly detection and localization based
on statistical machine learning techniques. Unlike previous
approaches, we propose anomaly detection and pattern mining
not only for operational statistics such as mean response
time, but also for structural behaviors of the system|what
parts of the system, in what combinations, are being exercised
in response to different kinds of external stimuli. In
addition, rather than building baseline models a priori, we
extract them by observing the behavior of the system over
a short period of time during normal operation. We explain
the necessary underlying assumptions and why they can be
realized by systems research, report on some early successes
using the approach, describe benefits of the approach that
make it competitive as a path toward self-managing systems,
and outline some research challenges. Our hope is
that this approach will enable "new science" in the design of
self-managing systems by allowing the rapid and widespread
application of statistical learning theory techniques (SLT) to
problems of system dependability.
![]() PDF file |
In 2004 Workshop on Self-Managed Systems (WOSS'04) in conjunction with ACM SIGSOFT FSE-12
| Type | Proceedings |