Probe Scheduling for Efficient Detection of Silent Failures

  • Edith Cohen ,
  • Avinatan Hassidim ,
  • Haim Kaplan ,
  • Yishay Mansour ,
  • Danny Raz ,
  • Yoav Tzur

MSR-TR-2014-113 |

This is a full version with proofs of a Performance 2014 paper

Most discovery systems for silent failures work in two phases: a continuous monitoring phase that detects presence of failures through probe packets and a localization phase that pinpoints the faultyelement(s). We focus on the monitoring phase, where the goal is to balance the probing overhead with the cost associated with longer failure detection times.

We formulate a general model for the underlying fundamental subset-test scheduling problem. We unify the treatment of schedulers and cost objectives and make several contributions: We propose Memoryless schedules — a natural subclass of stochastic schedules which is simple and suitable for distributed deployment. We show that the optimal memoryless schedulers can be efficiently computed by convex programs (for SUM objectives, which minimize average detection time) or linear programs (for MAX objectives, which minimize worst-case detection time), and surprisingly perhaps, are guaranteed to have expected detection times that are not too far off the (NP hard) stochastic optima. We study Deterministic schedules, which provide a guaranteed bound on the maximum (rather than expected) cost of undetected faults, but like general stochastic schedules, are NP hard to optimize. We develop novel efficient deterministic schedulers with provable approximation ratios.

Finally, we conduct an experimental study, simulating our schedulers on real networks topologies, demonstrates a significant performance gains of the new memoryless and deterministic schedulers
over previous approaches.