Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing

  • Hongyi Zeng ,
  • Ratul Mahajan ,
  • Nick McKeown ,
  • George Varghese ,
  • Lihua Yuan ,
  • Ming Zhang

MSR-TR-2015-55 |

Troubleshooting large operational networks is extremely difficult due to the extensive usage of multipath routing. We present NetSonar, a system that localizes performance problems in such networks. It uses planned tomography, whose input comes from a novel test technique that maximizes component coverage while minimizing probing overhead. Earlier techniques are either white box (assuming complete knowledge of network’s forwarding state) or black box (assuming no knowledge). We argue that the former is infeasible in large networks and the latter is inefficient. We use gray box technique that needs only coarse forwarding information (e.g., multipath configuration without knowledge of router-internal hash functions). NetSonar deals with nondeterminism in multipath by computing probabilistic path covers, and localizes faults accurately with minimal test overhead via diagnosable link covers. We describe our experience deploying NetSonar in a global inter-datacenter network. In a one-month period, NetSonar triggered 66 alerts, of which 56 were independently verified.