Sciweavers

CLUSTER
2004
IEEE

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

13 years 8 months ago
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection and recovery for message passing systems with different impact on application performance and the capacity to tolerate a high fault rate. In a recent paper, we have demonstrated that the main differences between pessimistic sender based message logging and coordinated checkpointing are 1) the communication latency and 2) the performance penalty in case of faults. Pessimistic message logging increases the latency, due to additional blocking control messages. When faults occur at a high rate, coordinated checkpointing implies a higher performance penalty than message logging due to a higher stress on the checkpoint server. In this paper we extend this study to improved versions of message logging and coordinated checkpoint protocols which respectively reduces the latency overhead of pessimistic message logging a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e
Added 20 Aug 2010
Updated 20 Aug 2010
Type Conference
Year 2004
Where CLUSTER
Authors Pierre Lemarinier, Aurelien Bouteiller, Thomas Hérault, Géraud Krawezik, Franck Cappello
Comments (0)