Detailed Diagnosis in Enterprise Networks

SIGCOMM |

Published by Association for Computing Machinery, Inc.

By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. The diagnostic system should be able to diagnose application specific abnormalities such as error codes. It should also be able to identify culprits at a fine granularity such as a process or a configuration element. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. The primary challenge in building this system is sifting through this plethora of information to infer when a component in the network (e.g., a process) might be impacting another. Our solution is based on a simple and intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of them impacting one another in the present. We show that our deployed prototype is very effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as top culprit in 80% of the cases. It is almost always in the list of top five culprits.