Sciweavers

SRDS
1994
IEEE

Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers

13 years 8 months ago
Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined only by the reliability requirements of the application. Efficient adaptation of this approach to DSM multicomputers is complicated by the absence of explicit messages in DSM systems, the presence of a shared and partially replicated address space, and the presence of a distributed coherency directory. We present solutions to these issues, and propose an error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers. Our performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previou...
G. Janakiraman, Yuval Tamir
Added 09 Aug 2010
Updated 09 Aug 2010
Type Conference
Year 1994
Where SRDS
Authors G. Janakiraman, Yuval Tamir
Comments (0)