This paper proposes a novel way to use virtual memorymapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications’ virtual address spac...
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which fai...
PULC is a user-level communication library for workstation clusters. PULC provides a multi-user, multi-programming communication library for user level communication on top of high...
Joachim M. Blum, Thomas M. Warschko, Walter F. Tic...
—Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. S...