Sciweavers

1256 search results - page 6 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
85
Voted
ICDCS
2008
IEEE
15 years 6 months ago
stdchk: A Checkpoint Storage System for Desktop Grid Computing
— Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that...
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S. Vaz...
68
Voted
EUROSYS
2009
ACM
15 years 8 months ago
Transparent checkpoints of closed distributed systems in Emulab
Emulab is a testbed for networked and distributed systems experimentation. Two guiding principles of its design are realism and control of experimentation. There is an inherent te...
Anton Burtsev, Prashanth Radhakrishnan, Mike Hible...
PPOPP
2003
ACM
15 years 5 months ago
Automated application-level checkpointing of MPI programs
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance com...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
65
Voted
ICPP
2007
IEEE
15 years 6 months ago
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
Qi Gao, Wei Huang, Matthew J. Koop, Dhabaleswar K....
IPPS
2007
IEEE
15 years 6 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...