Sciweavers

ICAC
2005
IEEE

Distributed Troubleshooting Agents

13 years 10 months ago
Distributed Troubleshooting Agents
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster and grid computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy's STAR project.
Charles Earl, Emilio Remolina, Jim Ong, John Brown
Added 24 Jun 2010
Updated 24 Jun 2010
Type Conference
Year 2005
Where ICAC
Authors Charles Earl, Emilio Remolina, Jim Ong, John Brown
Comments (0)