— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current t...
Arun Babu Nagarajan, Frank Mueller, Christian Enge...
In recent years, exciting technological advances have been made in development of flexible electronics. These technologies offer the opportunity to weave computation, communicat...
Roozbeh Jafari, Foad Dabiri, Philip Brisk, Majid S...
Brokers are used in many multi-agent systems for locating agents, for routing and sharing information, for managing the system, and for legal purposes, as independent third partie...
Sanjeev Kumar, Philip R. Cohen, Hector J. Levesque
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...