To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...
As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...
Chao Wang, Frank Mueller, Christian Engelmann, Ste...
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...
ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The init...
Graham E. Fagg, Thara Angskun, George Bosilca, Jel...