Sciweavers

ISCA
2011
IEEE

Rebound: scalable checkpointing for coherent shared memory

12 years 8 months ago
Rebound: scalable checkpointing for coherent shared memory
As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distr...
Rishi Agarwal, Pranav Garg, Josep Torrellas
Added 21 Aug 2011
Updated 21 Aug 2011
Type Journal
Year 2011
Where ISCA
Authors Rishi Agarwal, Pranav Garg, Josep Torrellas
Comments (0)