Sciweavers

EUROPAR
2005
Springer

Faults in Large Distributed Systems and What We Can Do About Them

13 years 10 months ago
Faults in Large Distributed Systems and What We Can Do About Them
Scientists are increasingly using large distributed systems built from commodity off-the-shelf components to perform scientific computation. Grid computing has expanded the scale of such systems by spanning them across organizations. While such systems are cost-effective, the usage of large number of commodity components causes high fault and failure rates. Some of these faults result in silent data corruption leaving users with possibly incorrect results. In this work, we analyzed the faults and failures that occurred in Condor pools at UW-Madison having a few thousand CPUs and in two large distributed applications: US-CMS and BMRB BLAST, each of which used hundreds of thousands of CPU hours. We propose ‘silent-fail-stutter’ fault-model to correctly model the silent failures and detail how to handle them. Based on the model, we have designed mechanisms that automatically detect and handle silent failures and ensure that users get correct results. Our mechanisms perform automated ...
George Kola, Tevfik Kosar, Miron Livny
Added 27 Jun 2010
Updated 27 Jun 2010
Type Conference
Year 2005
Where EUROPAR
Authors George Kola, Tevfik Kosar, Miron Livny
Comments (0)