As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include ...
We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first o...
This paper explores hardware-implemented error-detection and security mechanisms embedded as modules in a hardware-level framework called the Reliability and Security Engine (RSE)...
Nithin Nakka, Zbigniew Kalbarczyk, Ravishankar K. ...
Deterministic replay systems record and reproduce the execution of a hardware or software system. In contrast to replaying execution on uniprocessors, deterministic replay on mult...
Kaushik Veeraraghavan, Dongyoon Lee, Benjamin West...