Sciweavers

HPDC
2009
IEEE

Interconnect agnostic checkpoint/restart in open MPI

13 years 11 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination a...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
Added 21 May 2010
Updated 21 May 2010
Type Conference
Year 2009
Where HPDC
Authors Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
Comments (0)