Sciweavers

TPDS
2008

Algorithm-Based Fault Tolerance for Fail-Stop Failures

13 years 4 months ago
Algorithm-Based Fault Tolerance for Fail-Stop Failures
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplcation algorithms, the checksum relationship in the input checksum matrices is not maintained in ...
Zizhong Chen, Jack Dongarra
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2008
Where TPDS
Authors Zizhong Chen, Jack Dongarra
Comments (0)