Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current t...
Arun Babu Nagarajan, Frank Mueller, Christian Enge...
This paper explores the challenges associated with distributed application management in large-scale computing environments. In particular, we investigate several techniques for e...
Nikolay Topilski, Jeannie R. Albrecht, Amin Vahdat
In this paper, we address the new problem of protecting volunteer computing systems from malicious volunteers who submit erroneous results by presenting sabotagetolerance mechanis...
This paper develops some control structures suitable for composing fault-tolerant distrib uted applications using atomic actions (atomic transactions) as building blocks, and then...
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource--one that is, however, also unreliable, heterogeneous, an...