Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all computers are up and running, we would like the load to be evenly distrib...
This paper presents an approach for generating test data for unit-level, and possibly integration-level, testing based on sampling over intervals of the input probability distribu...
In this study, we investigate different cache fault tolerance techniques to determine which will be most effective when on-chip memory cell defect probabilities exceed those of cu...
Diagnosing faults in an operational computer network is a frustrating, time-consuming exercise. Despite advances, automatic diagnostic tools are far from perfect: they occasionall...
A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address t...