Sciweavers

SBACPAD
2015
IEEE

A Fault-Tolerance Protocol for Parallel Applications with Communication Imbalance

8 years 10 days ago
A Fault-Tolerance Protocol for Parallel Applications with Communication Imbalance
Abstract—The predicted failure rates of future supercomputers loom the groundbreaking research large machines are expected to foster. Therefore, resilient extreme-scale applications are an absolute necessity to effectively use the new generation of supercomputers. Rollback-recovery techniques have been traditionally used in HPC to provide resilience. Among those techniques, message logging provides the appealing features of saving energy, accelerating recovery, and having low performance penalty. Its increased memory consumption is, however, an important downside. This paper introduces memory-constrained message logging (MCML), a general framework for decreasing the memory footprint of message-logging protocols. In particular, we demonstrate the effectiveness of MCML in maintaining message logging feasible for applications with substantial communication imbalance. This type of applications appear in many scientific fields. We present experimental results with several parallel codes...
Esteban Meneses, Laxmikant V. Kalé
Added 17 Apr 2016
Updated 17 Apr 2016
Type Journal
Year 2015
Where SBACPAD
Authors Esteban Meneses, Laxmikant V. Kalé
Comments (0)