REMEM: REmote MEMory as Checkpointing Storage

9 years 11 months ago
REMEM: REmote MEMory as Checkpointing Storage
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a REmote MEMory checkpointing system to extend the checkpointing storage from disk to remote memory. A unique feature of REMEM is that it can be integrated into existing disk-based checkpointing systems seamlessly. A user can flexibly switch between REMEM and disk as checkpointing storage to balance the efficiency and reliability. The implementation of REMEM on Open MPI is also introduced. The experimental results confirm that REMEM and the proposed adaptive checkpointing storage selection are promising in both performance, reliability and scalability.
Hui Jin, Xian-He Sun, Yong Chen, Tao Ke
Added 10 Feb 2011
Updated 10 Feb 2011
Type Journal
Year 2010
Authors Hui Jin, Xian-He Sun, Yong Chen, Tao Ke
Comments (0)