Sciweavers

ICS
2011
Tsinghua U.

High performance linpack benchmark: a fault tolerant implementation without checkpointing

12 years 8 months ago
High performance linpack benchmark: a fault tolerant implementation without checkpointing
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modifies a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the right-looking LU factorization alg...
Teresa Davies, Christer Karlsson, Hui Liu, Chong D
Added 29 Aug 2011
Updated 29 Aug 2011
Type Journal
Year 2011
Where ICS
Authors Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, Zizhong Chen
Comments (0)