Sciweavers

9 search results - page 2 / 2
» Coordinated Checkpoint versus Message Log for Fault Tolerant...
Sort
View
IJHPCA
2006
117views more  IJHPCA 2006»
13 years 4 months ago
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
Aurelien Bouteiller, Thomas Hérault, G&eacu...
IPPS
2007
IEEE
13 years 11 months ago
A Fault Tolerance Protocol with Fast Fault Recovery
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Sayantan Chakravorty, Laxmikant V. Kalé
SC
2000
ACM
13 years 9 months ago
Scalable Fault-Tolerant Distributed Shared Memory
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...
Florin Sultan, Thu D. Nguyen, Liviu Iftode
HPDC
2009
IEEE
13 years 11 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine