Abstract. A grid checkpointing service providing migration and transparent fault tolerance is important for distributed and parallel applications executed in heterogeneous grids. I...
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
: As Century 21 just opened up, it is a fitting time to reflect on the evolution of the fault-tolerant distributed computing technology that occurred in the last century. The autho...
Long running applications often need to adapt due to changing requirements or changing environment. Typically, such adaptation is performed by dynamically adding or removing compo...
A fault-scalable service can be configured to tolerate increasing numbers of faults without significant decreases in performance. The Query/Update (Q/U) protocol is a new tool t...
Michael Abd-El-Malek, Gregory R. Ganger, Garth R. ...