Scalability vs. Fault Tolerance in Aspen Trees

  • Meg Walraed-Sullivan ,
  • Keith Marzullo ,
  • Amin Vahdat

MSR-TR-2013-21 |

Fault recovery is a key issue in modern data centers. In a fat tree topology, a single link failure can disconnect a set of end hosts from the rest of the network until updated routing information propagates to every network element in the topology. The time for re-convergence can be substantial, leaving hosts disconnected for long periods of time and significantly reducing the overall availability of the data center. Moreover, the message overhead of sending updated routing information to the entire topology may be unacceptable at scale. We present techniques to modify hierarchical data center topologies to enable switches to react to failures locally, thus reducing both the convergence time and the control overhead of failure recovery. We find that for a given network size, decreasing the convergence time for a topology results in a decrease to the topology’s scalability (e.g. the number of hosts supported). On the other hand, for a fixed host count, a reduction in convergence time comes at the cost of additional network elements and links. We explore the tradeoffs between fault tolerance, scalability and network size, and we propose a range of modified multi-rooted tree topologies that provide significantly reduced convergence time while retaining much of the traditionally-defined fat tree’s scalability.