Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore...
Supercomputing systems must be able to reliably and efficiently complete their assigned workloads, even in the presence of failures. This paper proposes a system that allows the ...
Adam J. Oliner, Larry Rudolph, Ramendra K. Sahoo, ...
We consider the problem of recovering from failures of distributable threads (“threads”) in distributed realtime systems that operate under run-time uncertainties including th...
Edward Curley, Binoy Ravindran, Jonathan Stephen A...
Round-Robin scheduling is the most popular time triggered scheduling policy, and has been widely used in communication networks for the last decades. It is an efficient schedulin...
Razvan Racu, Li Li, Rafik Henia, Arne Hamann, Rolf...
Commodity file systems trust disks to either work or fail completely, yet modern disks exhibit more complex failure modes. We suggest a new fail-partial failure model for disks, ...
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, N...