Sciweavers

PVM
2005
Springer

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

13 years 9 months ago
Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FTMPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation although robust, was very conservative and this effected its scalability on both very large clusters as well as on distributed systems. This paper details the FT-MPI recovery algorithm and our initial experiments with new recovery algorithms that are aimed at being both scalable and latency tolerant. Our conclusions shows that the use of both topology aware collective communication and distributed consensus algorithms together produce the best results.
Graham E. Fagg, Thara Angskun, George Bosilca, Jel
Added 28 Jun 2010
Updated 28 Jun 2010
Type Conference
Year 2005
Where PVM
Authors Graham E. Fagg, Thara Angskun, George Bosilca, Jelena Pjesivac-Grbovic, Jack Dongarra
Comments (0)