As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault...
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed witho...
Zizhong Chen, Ming Yang, Guillermo A. Francia III,...
: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance techniq...
George Bosilca, Remi Delmas, Jack Dongarra, Julien...
System-level virtualization has been a research topic since the 70’s but regained popularity during the past few years because of the availability of efficient solution such as...