Sciweavers

IPPS
2005
IEEE
13 years 10 months ago
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
Adam J. Oliner, Ramendra K. Sahoo, José E. ...
ESCIENCE
2006
IEEE
13 years 10 months ago
Job Failure Analysis and Its Implications in a Large-Scale Production Grid
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interar...
Hui Li, David L. Groep, Lex Wolters, Jeffrey Templ...