Sciweavers

SC
2009
ACM

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

13 years 11 months ago
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
The scalability of future massively parallel processing (MPP) systems is being severely challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism. After a thorough analysis of MPP systems failure rates and failure sources, we propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, which reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4%...
Xiangyu Dong, Naveen Muralimanohar, Norman P. Joup
Added 19 May 2010
Updated 19 May 2010
Type Conference
Year 2009
Where SC
Authors Xiangyu Dong, Naveen Muralimanohar, Norman P. Jouppi, Richard Kaufmann, Yuan Xie
Comments (0)