Sciweavers

SC
2000
ACM

Scalable Fault-Tolerant Distributed Shared Memory

13 years 9 months ago
Scalable Fault-Tolerant Distributed Shared Memory
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent checkpointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to eff...
Florin Sultan, Thu D. Nguyen, Liviu Iftode
Added 01 Aug 2010
Updated 01 Aug 2010
Type Conference
Year 2000
Where SC
Authors Florin Sultan, Thu D. Nguyen, Liviu Iftode
Comments (0)