Using Statistical Monitoring to Detect Failures in Internet Services

Today, we are increasingly building large and complex systems whose workings we do not understand, and this lack of understanding translates into systems that are hard to manage and have low availability. The problem is that there is a disconnect between our high-level goals for the system and the low-level visibility and control we have into and over it. To keep a system running, operators must wade through the minutiae of its low-level architecture and implementation. This is not unlike driving a car while looking through a magnifying glass—the driver is both overwhelmed by the details immediately in front of him and unable to focus on more important items on the horizon.

A concrete example of this problem is fault detection in Internet services. Current surveys find that over 60% of otherwise well-managed Internet services exhibit user-visible errors or failures. Talking with industry, a significant portion of the time to recover from these failures (up to 75%) is the time required to simply realize that a service has failed. The challenge is that these Internet services are complex, poorly understood systems, and the correct operation of the application is only defined at a human-layer (“I know a problem when I see it”).

In this talk, I present my work on statistical monitoring, which combines systems research with statistical analysis and machine learning tools to transform low-level behaviors that are easy to observe into high-level indicators of failure. Unlike other techniques which detect high-level failures, this does not require a priori application-specific information; it thus needs little maintenance as a service evolves and changes over time. I will discuss results from testbed experiments, where detection “miss rates” are reduced by 30-70%, as well as early experiences analyzing failures at a large Internet service.

Speaker Details

Emre Kiciman is a graduating Ph.D. student in the Computer Science Department at Stanford University. He received his M.S. in Computer Science from Stanford in 2002, and his B.S. in Electrical Engineering and Computer Science from the University of California at Berkeley. Emre’s research interests lie broadly in the area of system dependability and the application of machine learning techniques to improve system manageability. He can be reached via e-mail at emrek@cs.stanford.edu.

Date:
Speakers:
Emre Kiciman
Affiliation:
Stanford