Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not c...
In this paper, we present a structure for monitoring a large set of computational clusters. We illustrate methods for scaling a monitor network comprised of many clusters while ke...
Federico D. Sacerdoti, Mason J. Katz, Matthew L. M...
Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will ...
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and co...
Paul Stelling, Ian T. Foster, Carl Kesselman, Crai...
abstract over a complex set of resources and provide a high-level way to share and manage them over the network. To be effective, such a system must address the challenges posed by...
Andrew S. Grimshaw, Adam Ferrari, Frederick Knabe,...