Sciweavers

IPPS
2006
IEEE

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

13 years 10 months ago
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead, especially when applications modify a large mount of memory between checkpoints. This paper presents an algorithm-based checkpoint-free fault tolerance approach in which, instead of taking checkpoints periodically, a coded global consistent state of the critical application data is maintained in memory by modifying applications to operate on encoded data. Although the applicability of this approach is not so general as the typical checkpoint/rollback-recovery approach, in parallel linear algebra computations where it usually works, because no periodical checkpoint or rollback-recovery is involved in thi...
Zizhong Chen, Jack Dongarra
Added 12 Jun 2010
Updated 12 Jun 2010
Type Conference
Year 2006
Where IPPS
Authors Zizhong Chen, Jack Dongarra
Comments (0)