MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

10 years 2 months ago
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing library in HPC applications. These two trends raise the need for fault tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault tolerance protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We present then four fault tolerant protocols implemented in a new generic framework for fault tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a microbenchmark and compare them for the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth ...
Aurelien Bouteiller, Thomas Hérault, G&eacu
Added 12 Dec 2010
Updated 12 Dec 2010
Type Journal
Year 2006
Authors Aurelien Bouteiller, Thomas Hérault, Géraud Krawezik, Pierre Lemarinier, Franck Cappello
Comments (0)