Sciweavers

JSSPP
2004
Springer

Performance Implications of Failures in Large-Scale Cluster Scheduling

13 years 9 months ago
Performance Implications of Failures in Large-Scale Cluster Scheduling
As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such largescale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling ...
Yanyong Zhang, Mark S. Squillante, Anand Sivasubra
Added 02 Jul 2010
Updated 02 Jul 2010
Type Conference
Year 2004
Where JSSPP
Authors Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, Ramendra K. Sahoo
Comments (0)