Sciweavers

ICCS
2007
Springer

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

13 years 10 months ago
Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication
Abstract. As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant run-time overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulationbased comparison of the proposed combined algorithm versus traditional strategies bas...
Maria Chtepen, Filip H. A. Claeys, Bart Dhoedt, Fi
Added 08 Jun 2010
Updated 08 Jun 2010
Type Conference
Year 2007
Where ICCS
Authors Maria Chtepen, Filip H. A. Claeys, Bart Dhoedt, Filip De Turck, Peter A. Vanrolleghem, Piet Demeester
Comments (0)