Sciweavers

CCGRID
2006
IEEE

Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

13 years 11 months ago
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, checkpointing/restart is widely used to provide the basic fault-tolerant functionality, yet it suffers from high overhead and its reactive characteriristic. In this work, we propose FT-Pro, an adaptive fault management mechanism that optimally chooses migration, checkpointing or no action to reduce the application execution time in the presence of failures based on the failure prediction. A cost-based evaluation model is presented for dynamic decision at run-time. Using the actual failure log from a production cluster at NCSA, we demonstrate that even with modest failure prediction accuracy, FT-Pro outperforms the traditional checkpointing/restart strategy by 13%-30% in terms of reducing the application execution time despite failures, which is a significant performance improvement for long-running applications.
Yawei Li, Zhiling Lan
Added 10 Jun 2010
Updated 10 Jun 2010
Type Conference
Year 2006
Where CCGRID
Authors Yawei Li, Zhiling Lan
Comments (0)