Sciweavers

31 search results - page 2 / 7
» The Design and Implementation of Checkpoint Restart Process ...
Sort
View
IPPS
2007
IEEE
14 years 3 months ago
Implementing and Evaluating Automatic Checkpointing
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...
CLUSTER
2004
IEEE
14 years 1 months ago
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...
Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé
HPDC
2009
IEEE
14 years 4 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
IPPS
2005
IEEE
14 years 2 months ago
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hiberna...
José Carlos Sancho, Fabrizio Petrini, Kei D...
IPPS
2007
IEEE
14 years 3 months ago
A Fault Tolerance Protocol with Fast Fault Recovery
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Sayantan Chakravorty, Laxmikant V. Kalé