With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
We present a formal approach to implement and certify fault-tolerance in real-time embedded systems. The faultintolerant initial system consists of a set of independent periodic t...
Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included so...
Radharamanan Radhakrishnan, Lantz Moore, Philip A....
Recent gains in energy-efficiency of new-generation NAND flash storage have strengthened the case for in-network storage by data-centric sensor network applications. This paper ...
Gaurav Mathur, Peter Desnoyers, Deepak Ganesan, Pr...