Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Victor Bahl
17 August 2009
By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. The diagnostic system should be able to diagnose application specific abnormalities such as error codes. It should also be able to identify culprits at a fine granularity such as a process or a configuration element. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. The primary challenge in building this system is sifting through this plethora of information to infer when a component in the network (e.g., a process) might be impacting another. Our solution is based on a simple and intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We show that our deployed prototype is very effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as top culprit in 80% of the cases. It is almost always in the list of top five culprits.
Publisher Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or email@example.com. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.