In this paper, we argue that the reliability of large-scale storage systems can be significantly improved by using better reliability metrics and more efficient policies for rec...
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...
—Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, ...
Critical components of a distributed system must be replicated to achieve high availability and fault tolerance. Current faulttolerant CORBA infrastructures have concentrated on m...
Wenbing Zhao, Louise E. Moser, P. M. Melliar-Smith
Modern shared-memory multiprocessors use complex memory system implementations that include a variety of non-trivial and interacting optimizations. More time is spent in verifying...
Manoj Plakal, Daniel J. Sorin, Anne Condon, Mark D...