BlueGene/L Failure Analysis and Prediction Models

13 years 6 months ago
BlueGene/L Failure Analysis and Prediction Models
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM’s BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime faulttolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as w...
Yinglung Liang, Yanyong Zhang, Anand Sivasubramani
Added 11 Jun 2010
Updated 11 Jun 2010
Type Conference
Year 2006
Where DSN
Authors Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, Ramendra K. Sahoo
Comments (0)