FALCON: a system for reliable checkpoint recovery in shared grid environments

15 years 12 months ago

Download cobweb.ecn.purdue.edu

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For guest users, free resources come at the cost of unpredictable “failures”, where failures are deﬁned as disruption in the guest job’s execution due to contention from the processes of the machine owner or the conventionally understood hardware and software failures. These unpredictable failures lead to unpredictable completion times. Checkpointrecovery has long been used for providing reliability in failureprone computing environments. Today’s production FGCS systems, such as Condor, use expensive, high-performance dedicated checkpoint servers, even though they could take advantage of free disk resources oﬀered by the clusters’ commodity machines. Also, in large, geographically distributed clusters, dedicated checkpoint servers may incur high checkpoint transfer latencies. In this paper we consider...

Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenm

Real-time Traffic

Applied Computing | Guest Jobs | SC 2009 | Shared Storage Hosts | Storage Hosts |

claim paper

Added	19 May 2010
Updated	19 May 2010
Type	Conference
Year	2009
Where	SC
Authors	Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenmann

Sciweavers

FALCON: a system for reliable checkpoint recovery in shared grid environments

Applied Computing | Guest Jobs | SC 2009 | Shared Storage Hosts | Storage Hosts |

Explore & Download

Productivity Tools

Sciweavers