Sciweavers

TC
2008

Adaptive Fault Management of Parallel Applications for High-Performance Computing

13 years 4 months ago
Adaptive Fault Management of Parallel Applications for High-Performance Computing
As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault management approach that combines proactive migration with reactive checkpointing. It aims to enable parallel applications to avoid anticipated failures via preventive migration and, in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed to make runtime decisions in response to failure prediction. Extensive experiments, by means of stochastic modeling and case studies with real applications, indicate that FT-Pro outperforms periodic checkpointing, in terms of reducing application completion times and improving resource utilization, by up to 43 percent.
Zhiling Lan, Yawei Li
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2008
Where TC
Authors Zhiling Lan, Yawei Li
Comments (0)