— Medard, Finn, Barry and Gallager proposed an elegant recovery scheme (known as the MFBG scheme) using redundant trees. Xue, Chen and Thulasiraman extended the MFBG scheme and i...
— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...
With the latest high-end computing nodes combining shared-memory multiprocessing with hardware multithreading, new scheduling policies are necessary for workloads consisting of mu...
Robert L. McGregor, Christos D. Antonopoulos, Dimi...
In this paper, our contributions are two-fold: First, we enhance the Min-Min and Sufferage heuristics under three risk modes driven by security concerns. Second, we propose a new ...
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have prim...
George A. Reis, Jonathan Chang, Neil Vachharajani,...