Sciweavers

IPPS
2006
IEEE

Evaluating cooperative checkpointing for supercomputing systems

13 years 10 months ago
Evaluating cooperative checkpointing for supercomputing systems
Cooperative checkpointing, in which the system dynamically skips checkpoints requested by applications at runtime, can exploit system-level information to improve performance and reliability in the face of failures. We evaluate the applicability of cooperative checkpointing to large-scale systems through simulation studies considering real workloads, failure logs, and different network topologies. We consider two cooperative checkpointing algorithms: work-based cooperative checkpointing uses a heuristic based on the amount of unsaved work and risk-based cooperative checkpointing leverages failure event prediction. Our results demonstrate that, compared to periodic checkpointing, riskbased checkpointing with event prediction accuracy as low as 10% is able to significantly improve system utilization and reduce average bounded slowdown by a factor of 9, without losing any additional work to failures. Similarly, work-based checkpointing conferred tremendous performance benefits in the f...
Adam J. Oliner, Ramendra K. Sahoo
Added 12 Jun 2010
Updated 12 Jun 2010
Type Conference
Year 2006
Where IPPS
Authors Adam J. Oliner, Ramendra K. Sahoo
Comments (0)