Sciweavers

HPDC
2011
IEEE
12 years 8 months ago
Algorithm-based recovery for iterative methods without checkpointing
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Zizhong Chen
ICPPW
2009
IEEE
13 years 2 months ago
Analyzing Checkpointing Trends for Applications on the IBM Blue Gene/P System
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
Harish Gapanati Naik, Rinku Gupta, Pete Beckman
ICPADS
2010
IEEE
13 years 2 months ago
Hybrid Checkpointing for MPI Jobs in HPC Environments
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...
Chao Wang, Frank Mueller, Christian Engelmann, Ste...
CLOUDCOM
2010
Springer
13 years 2 months ago
REMEM: REmote MEMory as Checkpointing Storage
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...
Hui Jin, Xian-He Sun, Yong Chen, Tao Ke
TMC
2010
143views more  TMC 2010»
13 years 2 months ago
Decentralized QoS-Aware Checkpointing Arrangement in Mobile Grid Computing
—This paper deals with decentralized, QoS-aware middleware for checkpointing arrangement in Mobile Grid (MoG) computing systems. Checkpointing is more crucial in MoG systems than...
Paul J. Darby III, Nian-Feng Tzeng
TPDS
1998
135views more  TPDS 1998»
13 years 4 months ago
On Coordinated Checkpointing in Distributed Systems
—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, ...
Guohong Cao, Mukesh Singhal
SIGOPS
2002
74views more  SIGOPS 2002»
13 years 4 months ago
Comments on "transparent user-level process checkpoint and restore for migration" by Bozyigit and Wasiq
The simple checkpointing and migration system for UNIX processes as described in the article of Bozyigit and Wasiq [1] can be improved in two ways: First by a technique to checkpo...
Felix Rauch, Thomas Stricker
CLUSTER
2004
IEEE
13 years 4 months ago
MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...
JPDC
2007
95views more  JPDC 2007»
13 years 4 months ago
Self-stabilizing algorithm for checkpointing in a distributed system
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detec...
Partha Sarathi Mandal, Krishnendu Mukhopadhyaya
JPDC
2006
104views more  JPDC 2006»
13 years 4 months ago
Performance analysis of different checkpointing and recovery schemes using stochastic model
Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have der...
Partha Sarathi Mandal, Krishnendu Mukhopadhyaya