Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration change...
We investigate the problem of detecting termination of a distributed computation in systems where processes can fail by crashing. Specifically, when the communication topology is ...
Neeraj Mittal, Felix C. Freiling, Subbarayan Venka...
Fault tolerance is now a primary design constraint for all major microprocessors. One step in determining a processor’s compliance to its failure rate target is measuring the Ar...
Abstract. Designing and programming dependable distributed applications is very difficult. Databases provide features like transactions and replication that can help in the impleme...
Peer-to-peer systems enable efficient resource aggregation and are inherently scalable since they do not depend on any centralized authority. However, lack of a centralized autho...