: The distributed recovery block (DRB) scheme is a widely applicable approach for realizing both hardware and software fault tolerance in real-time distributed and parallel compute...
This paper reports on the architecture and design of Starfish, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations. Starfish is unique in ...
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource--one that is, however, also unreliable, heterogeneous, an...
A challenging issue in today's server systems is to transparently deal with failures and application-imposed requirements for continuous operation. In this paper we address t...
Real-time statistics on smart meter consumption data must preserve consumer privacy and tolerate smart meter failures. Existing protocols for this private distributed aggregation m...