We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detect...
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and co...
Paul Stelling, Ian T. Foster, Carl Kesselman, Crai...
Abstract Self-healing, i.e. the capability of a system to autonomously detect failures and recover from them, is a very attractive property that may enable large-scale software sys...
Replication is a technique commonly used to increase the availability of services in distributed systems, including grid and web services. While replication is relatively easy for...
Xianan Zhang, Flavio Junqueira, Matti A. Hiltunen,...
ÐIn this paper, we explore techniques to detect Byzantine server failures in asynchronous replicated data services. Our goal is to detect arbitrary failures of data servers in a s...
Dahlia Malkhi, Michael K. Reiter, Avishai Wool, Re...