: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
—P2P computing platforms are subject to a wide range of attacks. In this paper, we propose a generalisation of the previous disk-less checkpointing approach for fault-tolerance i...
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented ...
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...