Combining Statistical Monitoring and Predictable Recovery for Self-Management

Complex distributed Internet services form the basis not

only of e-commerce but increasingly of mission-critical networkbased

applications. What is new is that the workload and internal

architecture of three-tier enterprise applications presents

the opportunity for a new approach to keeping them running

in the face of many common recoverable failures. The core

of the approach is anomaly detection and localization based

on statistical machine learning techniques. Unlike previous

approaches, we propose anomaly detection and pattern mining

not only for operational statistics such as mean response

time, but also for structural behaviors of the system|what

parts of the system, in what combinations, are being exercised

in response to different kinds of external stimuli. In

addition, rather than building baseline models a priori, we

extract them by observing the behavior of the system over

a short period of time during normal operation. We explain

the necessary underlying assumptions and why they can be

realized by systems research, report on some early successes

using the approach, describe benefits of the approach that

make it competitive as a path toward self-managing systems,

and outline some research challenges. Our hope is

that this approach will enable "new science" in the design of

self-managing systems by allowing the rapid and widespread

application of statistical learning theory techniques (SLT) to

problems of system dependability.

PDF file

In  2004 Workshop on Self-Managed Systems (WOSS'04) in conjunction with ACM SIGSOFT FSE-12


> Publications > Combining Statistical Monitoring and Predictable Recovery for Self-Management