Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which fai...
In this paper we present FailRank, a novel framework for integrating and ranking information sources that characterize failures in a grid system. After the failing sites have been...
Border Gateway Protocol (BGP) is the standard routing protocol used in the Internet for routing packets between the Autonomous Systems (ASes). It is known that BGP can take hundre...
This paper shows that, in an environment where we do not bound the number of faulty processes, the class P of Perfect failure detectors is the weakest (among realistic failure det...
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as e through the syst...
Mike Y. Chen, Anthony Accardi, Emre Kiciman, David...