Sciweavers

IPPS
2007
IEEE

Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing

13 years 10 months ago
Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing
Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed without incorporating information from system environments such as the amount of available memory and the local or network I/O bandwidth. However, from an application point of view, it is often desirable for fault tolerant high performance applications to be able to achieve high performance under whatever system environment it executes with as low fault tolerance overhead as possibile. In this paper, we demonstrate that, in order to achieve high reliability with as low performance penalty as possible, fault tolerant schemes in applications need to be able to adapt themselves to different system environments. We propose a framework under which different fault tolerant schemes can be incorporated in applications using an adaptive method. Under this framework, applications are able to choose near optimal fault tolera...
Zizhong Chen, Ming Yang, Guillermo A. Francia III,
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where IPPS
Authors Zizhong Chen, Ming Yang, Guillermo A. Francia III, Jack Dongarra
Comments (0)