Sciweavers

1256 search results - page 6 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
ICDCS
2008
IEEE
15 years 8 months ago
stdchk: A Checkpoint Storage System for Desktop Grid Computing
— Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that...
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S. Vaz...
81
Voted
EUROSYS
2009
ACM
15 years 11 months ago
Transparent checkpoints of closed distributed systems in Emulab
Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent te...
Anton Burtsev, Prashanth Radhakrishnan, Mike Hible...
101
Voted
PPOPP
2003
ACM
15 years 7 months ago
Automated application-level checkpointing of MPI programs
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance com...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
IPPS
2007
IEEE
15 years 8 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...