Identifying Failures in Grids through Monitoring and Ranking

10 years 7 months ago
Identifying Failures in Grids through Monitoring and Ranking
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been ranked, these can be eliminated from the job scheduling resource pool yielding in that way a more predictable, dependable and adaptive infrastructure. We also present the tools we developed towards evaluating the FailRank framework. In particular, we present the FailBase Repository which is a 38GB corpus of state information that characterizes the EGEE Grid for one month in 2007. Such a corpus paves the way for the community to systematically uncover new, previously unknown patterns and rules between the multitudes of parameters that can contribute to failures in a Grid environment. Additionally, we present an experimental evaluation study of the FailRank system over 30 days which shows that our framework identifies failures in 93% of the cases. We believe that our work constitutes another important step to...
Demetrios Zeinalipour-Yazti, Kyriakos Neocleous, C
Added 01 Jun 2010
Updated 01 Jun 2010
Type Conference
Year 2008
Where NCA
Authors Demetrios Zeinalipour-Yazti, Kyriakos Neocleous, Chryssis Georgiou, Marios D. Dikaiakos
Comments (0)