Sciweavers

BERTINORO
2005
Springer
13 years 10 months ago
Prediction-Based Software Availability Enhancement
We propose a new paradigm for software availability enhancement. We offer a two-step strategy: Failure prediction followed by maintenance actions with the objective of avoiding imp...
Felix Salfner, Günther A. Hoffmann, Miroslaw ...
IPPS
2005
IEEE
13 years 10 months ago
Proactive Fault Handling for System Availability Enhancement
Proactive fault handling combines prevention and repair actions with failure prediction techniques. We extend the standard availability formula by five key measures: (1) precisio...
Felix Salfner, Miroslaw Malek
DSN
2006
IEEE
13 years 11 months ago
BlueGene/L Failure Analysis and Prediction Models
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM’s BlueGene/L which can acc...
Yinglung Liang, Yanyong Zhang, Anand Sivasubramani...
CCGRID
2006
IEEE
13 years 11 months ago
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
As the scale of cluster computing grows, it is becoming hard for long-running applications to complete without facing failures on large-scale clusters. To address this issue, chec...
Yawei Li, Zhiling Lan
SRDS
2007
IEEE
13 years 11 months ago
Using Hidden Semi-Markov Models for Effective Online Failure Prediction
A proactive handling of faults requires that the risk of upcoming failures is continuously assessed. One of the promising approaches is online failure prediction, which means that...
Felix Salfner, Miroslaw Malek
ICPP
2007
IEEE
13 years 11 months ago
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He ...
ANSS
2007
IEEE
13 years 11 months ago
Failure Prediction in Computational Grids
Accurate failure prediction in Grids is critical for reasoning about QoS guarantees such as job completion time and availability. Statistical methods can be used but they suffer f...
Woochul Kang, Andrew S. Grimshaw
ICPP
2008
IEEE
13 years 11 months ago
Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study
Despite great efforts on the design of ultra-reliable components, the increase of system size and complexity has outpaced the improvement of component reliability. As a result, fa...
Jiexing Gu, Ziming Zheng, Zhiling Lan, John White,...
DSN
2009
IEEE
13 years 11 months ago
System log pre-processing to improve failure prediction
Log preprocessing, a process applied on the raw log before applying a predictive method, is of paramount importance to failure prediction and diagnosis. While existing filtering ...
Ziming Zheng, Zhiling Lan, Byung-Hoon Park, Al Gei...