Sciweavers

IPPS
2007
IEEE

Implementing and Evaluating Automatic Checkpointing

13 years 11 months ago
Implementing and Evaluating Automatic Checkpointing
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide this facility, a checkpoint mechanism is used to recover a failed parallel application rolling it back to an execution moment prior to occurrence of the failure. In this work we present a mechanism for managing checkpoint operations during the failures automatically. This mechanism records periodically the application’s context, identifies failed nodes and restarts MPI processes on the remaining nodes, allowing the continuity of the application and taking advantage of the computing accomplished previously. We describe a lot of changes inside source of the LAM/MPI. Experiments with an application for recognizing DNA similarity showed that despite the overhead caused by periodic checkpoints, the benefits can reach about 50% on a small cluster.
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where IPPS
Authors Antonio S. Martins, Ronaldo Augusto Lara Gonçalves
Comments (0)