Sciweavers

2 search results - page 1 / 1
» A Scalable Asynchronous Replication-Based Strategy for Fault...
Sort
View
HIPC
2007
Springer
13 years 10 months ago
A Scalable Asynchronous Replication-Based Strategy for Fault Tolerant MPI Applications
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, ...
John Paul Walters, Vipin Chaudhary
CLUSTER
2004
IEEE
13 years 8 months ago
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...
Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé