Sciweavers

1256 search results - page 7 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
97
Voted
IPPS
2006
IEEE
15 years 8 months ago
Recent advances in checkpoint/recovery systems
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of ...
Greg Bronevetsky, Rohit Fernandes, Daniel Marques,...
HPDC
2007
IEEE
15 years 8 months ago
Peer-to-peer checkpointing arrangement for mobile grid computing systems
This paper deals with a novel, distributed, QoS-aware, peer-topeer checkpointing arrangement component for mobile Grid (MoG) computing systems middleware. Checkpointing is more cr...
Paul J. Darby III, Nian-Feng Tzeng
166
Voted
ICDCS
2012
IEEE
13 years 4 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
105
Voted
CLOUDCOM
2010
Springer
14 years 12 months ago
REMEM: REmote MEMory as Checkpointing Storage
Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively stu...
Hui Jin, Xian-He Sun, Yong Chen, Tao Ke
CLUSTER
2004
IEEE
15 years 1 months ago
MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-pas...
Rajanikanth Batchu, Yoginder S. Dandass, Anthony S...