Sciweavers

ICDCS
2000
IEEE

Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems

13 years 8 months ago
Coherence-based Coordinated Checkpointing for Software Distributed Shared Memory Systems
Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. In this paper, we propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both the checkpointing overhead during failure-free execution and the cost of recovery from failures by leveraging existing coherence information maintained by SDSM. In the presence of system failures, it allows SDSM to recover from the most recent checkpoint, saving the re-computation time. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our CCC technique against both simple coordinated checkpointing (SCC) and incremental coordinated checkpointing (ICC) techniques by actually implementing these techniques in TreadMarks, a state-of-the-art SDSM system. The exper...
Angkul Kongmunvattana, Santipong Tanchatchawal, Ni
Added 31 Jul 2010
Updated 31 Jul 2010
Type Conference
Year 2000
Where ICDCS
Authors Angkul Kongmunvattana, Santipong Tanchatchawal, Nian-Feng Tzeng
Comments (0)