This paper investigates fault tolerance issues in Bistro, a wide area upload architecture. In Bistro, clients first upload their data to intermediaries, known as bistros. A destin...
: In a decentralised system the problems of fault tolerance, and in particular error recovery, vary greatly depending on the design assumptions. For example, in a distributed datab...
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix ...
—Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. S...
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...