Sciweavers

IPPS
2009
IEEE

DMTCP: Transparent checkpointing for cluster computations and the desktop

13 years 11 months ago
DMTCP: Transparent checkpointing for cluster computations and the desktop
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/semaphores...
Jason Ansel, Kapil Arya, Gene Cooperman
Added 24 May 2010
Updated 24 May 2010
Type Conference
Year 2009
Where IPPS
Authors Jason Ansel, Kapil Arya, Gene Cooperman
Comments (0)