Sciweavers

ICDCS
2012
IEEE

Combining Partial Redundancy and Checkpointing for HPC

11 years 7 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance techniqu...
James Elliott, Kishor Kharbas, David Fiala, Frank
Added 29 Sep 2012
Updated 29 Sep 2012
Type Journal
Year 2012
Where ICDCS
Authors James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt B. Ferreira, Christian Engelmann
Comments (0)