Sciweavers

MIDDLEWARE
2007
Springer

Using checkpointing to recover from poor multi-site parallel job scheduling decisions

13 years 10 months ago
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clusters. Previous research addressed the impact of inaccuracies in these estimates on overall system performance and found that multi-site scheduling techniques benefit from these estimates, even in the presence of considerable inaccuracy. While these results are encouraging, there are many instances where these errors result in poor scheduling decisions that cause network over-subscription. This situation can lead to significantly degraded application runtime performance and turnaround time. In this paper, we explore the use of job checkpointing to selectively stop offending jobs in order to alleviate network congestion and subsequently restart them when (and where) sufficient network resources are available. We then characterize the conditions and the extent to which checkpointing improves overall performan...
William M. Jones
Added 08 Jun 2010
Updated 08 Jun 2010
Type Conference
Year 2007
Where MIDDLEWARE
Authors William M. Jones
Comments (0)