Sciweavers

CORR
2008
Springer

Algorithmic Based Fault Tolerance Applied to High Performance Computing

13 years 3 months ago
Algorithmic Based Fault Tolerance Applied to High Performance Computing
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrixmatrix multiplication subroutine and we propose some models to predict its running time. Our
George Bosilca, Remi Delmas, Jack Dongarra, Julien
Added 09 Dec 2010
Updated 09 Dec 2010
Type Journal
Year 2008
Where CORR
Authors George Bosilca, Remi Delmas, Jack Dongarra, Julien Langou
Comments (0)