Sciweavers

149 search results - page 1 / 30
» The Performance of Coordinated and Independent Checkpointing
Sort
View
IPPS
1999
IEEE
13 years 8 months ago
The Performance of Coordinated and Independent Checkpointing
Checkpointing is a very effective technique to tolerate the occurrence of failures in distributed and parallel applications. The existing algorithms in the literature are basicall...
Luís Moura Silva, João Gabriel Silva
CLUSTER
2003
IEEE
13 years 9 months ago
Coordinated Checkpoint versus Message Log for Fault Tolerant MPI
— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...
Aurelien Bouteiller, Pierre Lemarinier, Gér...
SC
2000
ACM
13 years 8 months ago
Scalable Fault-Tolerant Distributed Shared Memory
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...
Florin Sultan, Thu D. Nguyen, Liviu Iftode
HICSS
2007
IEEE
124views Biometrics» more  HICSS 2007»
13 years 10 months ago
Building a Coordination Framework to Support Behavior-Based Adaptive Checkpointing for Open Distributed Embedded Systems
Checkpointing is a commonly used approach to provide fault-tolerance and improve system dependability. However, using a constant and preconfigured checkpointing frequency may comp...
Nianen Chen, Shangping Ren
CLUSTER
2005
IEEE
13 years 10 months ago
Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters
We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin virtualization ...
Oren Laadan, Dan B. Phung, Jason Nieh