Combining Partial Redundancy and Checkpointing for HPC

11 years 7 months ago

Download moss.csc.ncsu.edu

Today’s largest High Performance Computing (HPC) systems exceed one Petaﬂops (1015 ﬂoating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance techniqu...

James Elliott, Kishor Kharbas, David Fiala, Frank

Real-time Traffic

Distributed And Parallel Computing | Fault Tolerance Techniques | ICDCS 2012 | Partial Redundancy | Wallclock Time |

claim paper

Post Info
More Details (n/a)

Added	29 Sep 2012
Updated	29 Sep 2012
Type	Journal
Year	2012
Where	ICDCS
Authors	James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt B. Ferreira, Christian Engelmann

Comments (0)

Sciweavers

Combining Partial Redundancy and Checkpointing for HPC

Distributed And Parallel Computing | Fault Tolerance Techniques | ICDCS 2012 | Partial Redundancy | Wallclock Time |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers