Sciweavers

63 search results - page 2 / 13
» Adaptive incremental checkpointing for massively parallel sy...
Sort
View
LCPC
2007
Springer
13 years 11 months ago
Compiler-Enhanced Incremental Checkpointing
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety o...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
IPPS
2007
IEEE
13 years 11 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...
SC
2009
ACM
14 years 3 days ago
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
The scalability of future massively parallel processing (MPP) systems is being severely challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in ov...
Xiangyu Dong, Naveen Muralimanohar, Norman P. Joup...
ICDCS
2008
IEEE
13 years 11 months ago
stdchk: A Checkpoint Storage System for Desktop Grid Computing
— Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that...
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S. Vaz...
CLOUDCOM
2010
Springer
13 years 3 months ago
REMEM: REmote MEMory as Checkpointing Storage
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...
Hui Jin, Xian-He Sun, Yong Chen, Tao Ke