Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current t...
Arun Babu Nagarajan, Frank Mueller, Christian Enge...
VT-ASOS is a framework for holistic and continuous customization of system software on HPC systems. The framework leverages paravirtualization technology. VT-ASOS extends the Xen ...
Dimitrios S. Nikolopoulos, Godmar Back, Jyotirmaya...
There has recently been increasing interests in using system virtualization to improve the dependability of HPC cluster systems. However, it is not cost-free and may come with som...
Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, ...
Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/r...
System-level virtualization has been a research topic since the 70’s but regained popularity during the past few years because of the availability of efficient solution such as...