Sciweavers

SC
2015
ACM

BAD-check: bulk asynchronous distributed checkpointing

8 years 8 days ago
BAD-check: bulk asynchronous distributed checkpointing
Leadership-scale scientific simulations running as tens of thousands of tightly-coupled MPI processes are vulnerable to interruption due to a single process or node failure. Due to the dependence of each state calculation on the successful completion of each of the prior state calculations, checkpointrestart is the most widely-used technique to achieve fault tolerance. To write a consistent view of distributed state as a checkpoint, applications typically synchronize and pause while writing data to persistent media. In this paper we present a transactional protocol that enables asynchronous distributed creation of checkpoint data sets, and describe the conditions under which it is beneficial. With simulations, we demonstrate that scientific applications exhibiting computational variance without frequent synchronization can use our protocol to either reduce run time by up to 27% or reduce required storage system capability by up to 40%.
John Bent, Brad Settlemyer, Haiyun Bao, Sorin Faib
Added 17 Apr 2016
Updated 17 Apr 2016
Type Journal
Year 2015
Where SC
Authors John Bent, Brad Settlemyer, Haiyun Bao, Sorin Faibish, Jeremy Sauer, Jingwang Zhang
Comments (0)