Sciweavers

CCGRID
2008
IEEE
13 years 6 months ago
Fault Tolerance in Cluster Federations with O2P-CF
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide hug...
Thomas Ropars, Christine Morin
CLUSTER
2004
IEEE
13 years 8 months ago
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...
EDCC
2005
Springer
13 years 10 months ago
Performance Evaluation of Consistent Recovery Protocols Using MPICH-GF
This paper presents an implementation of several consistent protocols at the abstract device level and their performance comparison. We have performed experiments using three NAS P...
Namyoon Woo, Hyungsoo Jung, Dongin Shin, Hyuck Han...