Sciweavers

ESCIENCE
2006
IEEE

Job Failure Analysis and Its Implications in a Large-Scale Production Grid

13 years 10 months ago
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level. Our results show that the overall failure rates in the Grid are quite significant, ranging from 25% to 33% of all submitted jobs. However, only 5% to 8% of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of...
Hui Li, David L. Groep, Lex Wolters, Jeffrey Templ
Added 11 Jun 2010
Updated 11 Jun 2010
Type Conference
Year 2006
Where ESCIENCE
Authors Hui Li, David L. Groep, Lex Wolters, Jeffrey Templon
Comments (0)