Sciweavers

133 search results - page 18 / 27
» A language-driven tool for fault injection in distributed sy...
Sort
View
ASPLOS
2009
ACM
15 years 10 months ago
ASSURE: automatic software self-healing using rescue points
Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover softw...
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nic...
91
Voted
SAFECOMP
1999
Springer
15 years 1 months ago
Hierarchically Performed Hazard Origin and Propagation Studies
Abstract. This paper introduces a new method for safety analysis called HiPHOPS (Hierarchically Performed Hazard Origin and Propagation Studies). HiP-HOPS originates from a number ...
Yiannis Papadopoulos, John A. McDermid
ECRTS
1999
IEEE
15 years 1 months ago
Cluster simulation-support for distributed development of hard real-time systems using TDMA-based communication
In the eld of safety-critical real-time systems the development of distributed applications for fault tolerance reasons is a common practice. Hereby the whole application is divid...
Thomas M. Galla, Roman Pallierer
DSN
2009
IEEE
15 years 4 months ago
Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems
The threat of soft error induced system failure in high performance computing systems has become more prominent, as we adopt ultra-deep submicron process technologies. In this pap...
Naga Durga Prasad Avirneni, Viswanathan Subramania...
CCGRID
2006
IEEE
15 years 1 months ago
IPMI-based Efficient Notification Framework for Large Scale Cluster Computing
The demand for an efficient fault tolerance system has led to the development of complex monitoring infrastructure, which in turn has created an overwhelming task of data and even...
Chokchai Leangsuksun, Tirumala Rao, Anand Tikoteka...