Scheduling Message Processing for Reducing Rollback Propagation

  • Yi-Min Wang ,
  • W. Kent Fuchs

Published by Institute of Electrical and Electronics Engineers, Inc.

Publication

Traditional checkpointing and rollback recovery techniques for parallel systems have typically assumed the communication pattern is specified by program behavior. In this paper we exploit the property that the communication pattern can often be changed at runtime without affecting program correctness. A scheduling algorithm for message processing and its implementation for reducing rollback propagation are described. The algorithm incorporates a user-transparent prioritized scheme based upon the run-time communication and checkpointing history. Communication trace-driven simulation for several parallel programs written in the Chare Kernel language demonstrates that the probability of rollback propagation can be reduced at the cost of slight additional performance degradation.