Algorithm-based recovery for iterative methods without checkpointing

9 years 10 months ago
Algorithm-based recovery for iterative methods without checkpointing
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor rollback is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced ...
Zizhong Chen
Added 20 Aug 2011
Updated 20 Aug 2011
Type Journal
Year 2011
Where HPDC
Authors Zizhong Chen
Comments (0)