Sciweavers

CCGRID
2006
IEEE

Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

13 years 10 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library and RTE (Run Time Environment) handle failures at the system level, but provide the normal HPC application developers with various recovery choices with varying performance and cost. In this paper, we continue the work on extending the MPI’s capability in this direction. Firstly, we are proposing an MPI operation level checkpoint/rollback library to recover the user’s data. More importantly, we argue that the future generation programming model of a fault tolerant MPI application should be r...
Yuan Tang, Graham E. Fagg, Jack Dongarra
Added 10 Jun 2010
Updated 10 Jun 2010
Type Conference
Year 2006
Where CCGRID
Authors Yuan Tang, Graham E. Fagg, Jack Dongarra
Comments (0)