Performance Implications of Failures in Large-Scale Cluster Scheduling

14 years 2 months ago

Download www.ece.rutgers.edu

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such largescale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a signiﬁcant impact on the mean job response time and mean job slowdown under existing scheduling ...

Yanyong Zhang, Mark S. Squillante, Anand Sivasubra

Real-time Traffic

JSSPP 2004 | Large-scale Parallel Systems | Mean Job | Performance |

claim paper

» A FailureAware Scheduling Strategy in LargeScale Cluster System

» Job Failure Analysis and Its Implications in a LargeScale Production Grid

» A Hybrid RealTime Scheduling Approach for LargeScale Multicore Platforms

» Monitoring and Debugging Parallel Software with BCSMPI on LargeScale Clusters

» Network coding for large scale content distribution

» Performance Analysis of Grid DAG Scheduling Algorithms using MONARC Simulation Tool

» Bicriteria Scheduling Algorithm with Deployment in Cluster

» Chameleon A Resource Scheduler in A Data Grid Environment

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	JSSPP
Authors	Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, Ramendra K. Sahoo

Comments (0)

Sciweavers

Performance Implications of Failures in Large-Scale Cluster Scheduling

JSSPP 2004 | Large-scale Parallel Systems | Mean Job | Performance |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers