In this paper the problem of fault-tolerant message routing in two-dimensional meshes, with each inner node having 4 neighbors, is investigated. It is assumed that some nodes/links...
In this paper, we propose a fault-tolerant mechanism for microprocessors, which detects transient faults and recovers from them. There are two driving force to investigate fault-t...
The UpRight library seeks to make Byzantine fault tolerance (BFT) a simple and viable alternative to crash fault tolerance for a range of cluster services. We demonstrate UpRight ...
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang ...
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...