The threat of soft error induced system failure in high performance computing systems has become more prominent, as we adopt ultra-deep submicron process technologies. In this pap...
As we approach nation-wide integration of computer systems, it is clear that le replication will play a key role, both to improve data availability in the face of failures, and to...
Richard G. Guy, John S. Heidemann, Wai-Kei Mak, Th...
We present a tool for the analysis of fault-tolerance in packet-switched communication networks. Network elements like links or routers can fail or unexpected traffic surges may o...
David Hock, Michael Menth, Matthias Hartmann, Chri...
This paper describes a fault detection mechanism that uses the error codes returned by the stream sockets to locate process failures. Since these errors are generated automaticall...
In the asynchronous distributed system model, consensus is obtained in one communication step if all processes propose the same value. Assuming f < n/3, this is regardless of t...