Abstract. This paper proposes a kernel to kernel communication system for use in cluster computers. It is implemented directly on the Ethernet data link layer. This allows use of E...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
A problem with many distributed applications is their behavior in lieu of unpredictable variations in user request volumes or in available resources. This paper explores a performa...
Abstract—Performance is a key feature of large-scale computing systems. However, the achieved performance when a certain program is executed is significantly lower than the maxi...
Abstract. Clusters of loosely connected machines are becoming an important model for commercial computing. The cost/performance ratio makes these scale-out solutions an attractive ...
Robert W. Wisniewski, Reza Azimi, Mathieu Desnoyer...