Sciweavers

ICPP
2000
IEEE

A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

13 years 8 months ago
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
The idle computers on a local area, campus area, or even wide area network represent a significant computational resource--one that is, however, also unreliable, heterogeneous, and opportunistic. We describe an algorithm that allows branch-and-boundproblems to be solved in such environments. In designing this algorithm, we faced two challenges: (1) scalability, to effectively exploit the variably sized pools of resources available, and (2) fault tolerance, to ensure the reliability of services. We achieve scalability through a fully decentralized algorithm, in which the dynamically available resources are managed through a membership protocol. We guarantee fault tolerance in the sense that the loss of up to all but one resource will not affect the quality of the solution. For propagating information reliably, we use epidemic communication for both the membership protocol and the fault-tolerance mechanism. We have developed a simulation framework that allows us to evaluate design alter...
Adriana Iamnitchi, Ian T. Foster
Added 25 Aug 2010
Updated 25 Aug 2010
Type Conference
Year 2000
Where ICPP
Authors Adriana Iamnitchi, Ian T. Foster
Comments (0)