Sciweavers

IPPS
2007
IEEE

DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems

13 years 11 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications. DejaVu provides a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications or the OS. It uses a new runtime mechanism for transparent incremental checkpointing that captures the least amount of state needed to maintain global consistency and provides a novel communication architecture that enables transparent migration of existing MPI codes, without source-code modifications. Performance results from the production-ready implementation show less than 5% overhead in real-world parallel applications with large memory footprints. 1
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where IPPS
Authors Joseph F. Ruscio, Michael A. Heffner, Srinidhi Varadarajan
Comments (0)