Sciweavers

IPPS
2007
IEEE

Tiresias: Black-Box Failure Prediction in Distributed Systems

13 years 10 months ago
Tiresias: Black-Box Failure Prediction in Distributed Systems
Faults in distributed systems can result in errors that manifest in several ways, potentially even in parts of the system that are not collocated with the root cause. These manifestations often appear as deviations (or “errors”) in performance metrics. By transparently gathering, and then identifying escalating anomalous behavior in, various node-level and system-level performance metrics, the Tiresias system makes black-box failure-prediction possible. Through the trend analysis of performance metrics, Tiresias provides a window of opportunity (look-ahead time) for system recovery prior to impending crash failures. We empirically validate the heuristic rules of the Tiresias system by analyzing fault-free and faulty performance data from a replicated middleware-based system.
Andrew W. Williams, Soila M. Pertet, Priya Narasim
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where IPPS
Authors Andrew W. Williams, Soila M. Pertet, Priya Narasimhan
Comments (0)