Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
Object-based checkpoints are consistent in the object-based system but may be inconsistent according to the traditional message-based definition. We present a protocol for taking ...
In many network applications the computation takes place on the minimum-cost spanning tree (MST) of the network; unfortunately, a single link or node failure disconnects the tree. ...
Paola Flocchini, Toni Mesa Enriquez, Linda Pagli, ...
The perfectly synchronized round model provides the abstraction of crash-stop failures with atomic message delivery. This abstraction makes distributed programming very easy. We p...
The advent of high-performance networks in conjunction with low-cost, powerful computational engines has made possible the development of a new set of technologies termed computat...