Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover softw...
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nic...
Abstract. This paper introduces a new method for safety analysis called HiPHOPS (Hierarchically Performed Hazard Origin and Propagation Studies). HiP-HOPS originates from a number ...
In the eld of safety-critical real-time systems the development of distributed applications for fault tolerance reasons is a common practice. Hereby the whole application is divid...
The threat of soft error induced system failure in high performance computing systems has become more prominent, as we adopt ultra-deep submicron process technologies. In this pap...
The demand for an efficient fault tolerance system has led to the development of complex monitoring infrastructure, which in turn has created an overwhelming task of data and even...