Efficient intra-node shared memory communication is important for High Performance Computing (HPC), especially with the emergence of multi-core architectures. As clusters continue ...
Agent-oriented software is promising improvements especially for the design of distributed systems. But currently, there is a substantial gap between the massive number of publica...
Recent and future parallel clusters and supercomputers use SMPs and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have h...
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which fai...
Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration change...