Mega grids span several continents and may consist of millions of nodes and billions of tasks executing at any point in time. This setup calls for scalable and highly available re...
—The widespread use of tracking and localization systems may be hindered by centralized server platforms whose performance can hardly scale up to the needs of very large numbers ...
Marco Picone, Michele Amoretti, Francesco Zanichel...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explore...
This paper describes an algorithm that allows Linux to perform multilevel load balancing in NUMA computers. The Linux scheduler implements a load balancing algorithm that uses str...