Sciweavers

ICS
2007
Tsinghua U.

Proactive fault tolerance for HPC with Xen virtualization

13 years 10 months ago
Proactive fault tolerance for HPC with Xen virtualization
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today’s systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from “unhealthy” nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen’s live migration mechanism for a guest operating system (OS) to migrate an MP...
Arun Babu Nagarajan, Frank Mueller, Christian Enge
Added 08 Jun 2010
Updated 08 Jun 2010
Type Conference
Year 2007
Where ICS
Authors Arun Babu Nagarajan, Frank Mueller, Christian Engelmann, Stephen L. Scott
Comments (0)