A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant progr...
Typical MPI applications work in phases of computation and communication, and messages are exchanged in relatively small chunks. This behavior is not optimal for TCP because TCP i...
— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...
Abstract. VolpexMPI is an MPI library designed for volunteer computing environments. In order to cope with the fundamental unreliability of these environments, VolpexMPI deploys tw...
Abstract. The ability to offload functionality to a programmable network interface is appealing, both for increasing message passing performance and for reducing the overhead on t...