Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; an...
Charles Earl, Emilio Remolina, Jim Ong, John Brown
Many systems, such as large manufacturing systems, telecommunication networks, or homeautomation systems, require distributed monitoring and diagnosis. In this article, we introdu...
As grid computation systems become larger and more complex, manually diagnosing failures in jobs becomes impractical. Recently, machine-learning techniques have been proposed to d...
—Traditional approaches for wireless sensor network diagnosis are mainly sink-based. They actively collect global evidences from sensor nodes to the sink so as to conduct central...
There is widespread interest today in developing tools that can diagnose the cause of a system failure accurately and efficiently based on monitoring data collected from the syst...