Fault Management Using the CONMan Abstraction

Published by IEEE

Publication

Fault management in networks is difficult. We argue that a major contributor to the difficulty of debugging network faults is the sheer volume of semantically anemic details exposed by protocols. Unlike past approaches that try to cope with the deluge of information exposed, in this paper we explore how to reduce and structure the management information exposed by data-plane protocols and devices to make them more amenable to fault management. To this effect, we delineate two conditions that the management interface of data-plane protocols should satisfy: it should provide a structured description of protocol reality and it should support what we call a “conservation of bytes” invariant. Based on this, we propose an architecture wherein dataplane protocols expose management information satisfying these conditions. This allows management applications to detect, localize and (possibly) resolve faults in a structured fashion. We discuss the detection of a representative set of real-world faults to illustrate our approach. We implemented these fault management features into three protocols and built a management application that uses the features to debug faults. Apart from serving as a proof of concept, this exercise indicates that our proposal does indeed simplify debugging of a large fraction of network faults.