Anomaly localization in large-scale clusters

13 years 11 months ago
Anomaly localization in large-scale clusters
— A critical problem facing by managing large-scale clusters is to identify the location of problems in a system in case of unusual events. As the scale of high performance computing (HPC) grows, systems are getting bigger. When a system fails to function properly, health-related data are collected for troubleshooting. However, due to the massive quantities of information obtained from a large number of components, the root causes of anomalies are often buried like needles in a haystack. In this paper, we present a localization method to automatically find out the potential root causes (i.e. a subset of nodes) of the problem from the overwhelming amount of data collected system-wide. System managers can focus on examining these potential locations, thereby significantly reducing human efforts required for anomaly localization. Our method consists of three interrelated steps: (1) feature collection to assemble a feature space for the system; (2) feature extraction to obtain the most...
Ziming Zheng, Yawei Li, Zhiling Lan
Added 02 Jun 2010
Updated 02 Jun 2010
Type Conference
Year 2007
Authors Ziming Zheng, Yawei Li, Zhiling Lan
Comments (0)