Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

15 years 11 months ago

Download nowlab.cse.ohio-state.edu

—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for large scale parallel jobs. However, the performance of the Checkpoint/Restart mechanism does not scale well with increasing job size due to constraints within the ﬁle system. Furthermore, with the advent of multi-core architecture, the situation is aggravated due to larger number of processes running on the same node, trying to checkpoint simultaneously. This results in increased number of ﬁle writes at the time of checkpointing which leads to performance degradation. As a result, deployment of Checkpoint/Restart mechanisms for large scale parallel applications is limited. In this work, we explore the Checkpoint/Restart mechanism in MVAPICH2, which uses BLCR as the checkpointing library. Our proﬁling of the checkpoints for the NAS parallel benchmarks revealed a large number of small ﬁle writes interspers...

Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales

Real-time Traffic

Checkpoint/restart Mechanism | Distributed And Parallel Computing | ICPP 2009 | Large Scale Parallel | ﬁle Writes |

claim paper

Post Info
More Details (n/a)

Added	23 May 2010
Updated	23 May 2010
Type	Conference
Year	2009
Where	ICPP
Authors	Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabaleswar K. Panda

Comments (0)

Sciweavers

Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

Checkpoint/restart Mechanism | Distributed And Parallel Computing | ICPP 2009 | Large Scale Parallel | ﬁle Writes |

Explore & Download

Productivity Tools

Sciweavers