Sciweavers

GPC
2007
Springer

Fault Management in P2P-MPI

13 years 10 months ago
Fault Management in P2P-MPI
We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially attention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Experiments on applications of up to 256 processes, carried out on Grid’5000 show that the real detection times closely match the predictions.
Stéphane Genaud, Choopan Rattanapoka
Added 07 Jun 2010
Updated 07 Jun 2010
Type Conference
Year 2007
Where GPC
Authors Stéphane Genaud, Choopan Rattanapoka
Comments (0)