Present and future semiconductor technologies are characterized by increasing parameters variations as well as an increasing susceptibility to external disturbances. Transient err...
Abstract. Current solutions for fault-tolerance in HPC systems focus on dealing with the result of a failure. However, most are unable to handle runtime system configuration change...
In this paper we present a robust software infrastructure for metacomputing. The system is intended to be used by others as a building block for large and powerful computational g...
The advent of large scale multi-hop wireless networks highlights problems of fault tolerance and scale in distributed system, motivating designs that autonomously recover from tra...
The demand for an efficient fault tolerance system has led to the development of complex monitoring infrastructure, which in turn has created an overwhelming task of data and even...