Fault-tolerant mobile computing systems have different requirements and restrictions, not taken into account by conventional distributed systems. This paper presents a coordinate...
Several schemes for checkpointing and rollback recovery have been reported in the literature. In this paper, we analyze some of these schemes under a stochastic model. We have der...
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At...
Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, ...
The scalability of future massively parallel processing (MPP) systems is being severely challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in ov...
Xiangyu Dong, Naveen Muralimanohar, Norman P. Joup...