Radhika Niranjan Mysore, Ratul Mahajan, Amin Vahdat, and George Varghese
1 July 2013
Researchers have proposed many algorithms for localizing faults in networked systems, but it is unclear which algorithm is best suited for a given network; the performance of these algorithms differs markedly for different networks. We develop a framework that can explain these differences by anatomizing the algorithms into their basic choices and analyzing these choices with respect to six defining characteristics of real networks. Our analysis also reveals that no existing algorithm simultaneously provides good localization accuracy and low computational overhead. Based on our insights, we develop a new algorithm called Gestalt. To perform well across a range of networks, Gestalt combines the good choices of existing algorithms and with a new method to explore the space of possible faults in a way that is both low overhead and robust to noise. We apply it to three real, diverse networks: an email network, a peer-to-peer messaging system, and an ISP network. In each case, Gestalt has either significantly higher localization accuracy or an order of magnitude faster running time. For example, when applied to Lync, Gestalt localizes faults with the same accuracy as Sherlock, while reducing fault localization time from days to 23s on a single system.