Sciweavers

8 search results - page 2 / 2
» Improved message logging versus improved coordinated checkpo...
Sort
View
IJHPCA
2006
117views more  IJHPCA 2006»
13 years 4 months ago
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
Aurelien Bouteiller, Thomas Hérault, G&eacu...
ISCA
2002
IEEE
115views Hardware» more  ISCA 2002»
13 years 9 months ago
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At...
Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, ...
CONCURRENCY
2010
110views more  CONCURRENCY 2010»
13 years 4 months ago
Redesigning the message logging model for high performance
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow th...
Aurelien Bouteiller, George Bosilca, Jack Dongarra