A Checkpoint Protocol for an Entry Consistent Shared Memory System

  • Nuno Neves ,
  • Miguel Castro ,
  • Paulo Guedes

Symposium on Principles of Distributed Computing (PODC'94) |

Published by ACM

Workstation clusters are becoming an interesting alternative to dedicated multiprocessors. In this environment, the probability of a failure, during an application’s execution, increases with the execution time and the number of workstations used. If no provision is made for handling failures, it is unlikely that long running applications will terminate successfully. One solution to this problem is process checkpointing. This paper presents a checkpoint protocol for a multithreaded distributed shared memory system based on the entry consistency memory model. The protocol allows transparent recovery from single node failures and, in some cases, from multiple node failures. A simple mechanism is used to determine if the system can be brought to a consistent state in the event of multiple machine crashes. The protocol keeps a distributed log of shared data accesses in the volatile memory of the processes, taking advantage of the independent failure characteristics of workstation clusters. Periodically, or whenever the log reaches a highwater mark, each process checkpointsitsstate, independently from the others. The protocol needs no extra messages during the failure-free period, since all checkpoint control information is piggybacked on the memory coherence protocol messages.