To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
Abstract. To achieve an efficient utilization of cluster systems, a proper programming and operating environment is required. In this context, mobile agents are of growing interes...
We present the concept of alternative functionality for improving dependability in distributed embedded systems. Alternative functionality is a mechanism that complements traditio...
For the emerging ambient environments, in which interconnected intelligent devices will surround us to increase the comfort of our lives, fault tolerance and security are of paramo...
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...