Sciweavers

CCGRID
2010
IEEE

Selective Recovery from Failures in a Task Parallel Programming Model

13 years 5 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism. Compared with conventional checkpoint/restart techniques, this system offers a recovery penalty that is proportional to the degree of failure rather than the system size. We evaluate this system using the Self Consistent Field (SCF) kernel which forms an important component in ab initio methods for computational chemistry. Experimental results indicate that fault tolerant task pools are robust in the presence of an arbitrary number of failures and that they offer low overhead in the absence of faults. Keywords-Parallel processing, fault tolerance, task parallelism, Global Arrays, PGAS, selective recovery
James Dinan, Arjun Singri, P. Sadayappan, Sriram K
Added 08 Nov 2010
Updated 08 Nov 2010
Type Conference
Year 2010
Where CCGRID
Authors James Dinan, Arjun Singri, P. Sadayappan, Sriram Krishnamoorthy
Comments (0)