Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; an...
Charles Earl, Emilio Remolina, Jim Ong, John Brown
Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service down-time. Most industrial applicati...
General-purpose middleware, by definition, cannot readily support domain-specific semantics without significant manual efforts in specializing the middleware. This paper prese...
Sumant Tambe, Akshay Dabholkar, Aniruddha S. Gokha...
In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynamically changing in a rando...
Sagar Dhakal, Majeed M. Hayat, Jorge E. Pezoa, Cha...
No cache based techniques for roll-forward fault recovery exist at present. A split-cache approach is proposed that provides e cient support for checkpointing and roll-forward fau...