Proactive fault handling combines prevention and repair actions with failure prediction techniques. We extend the standard availability formula by five key measures: (1) precisio...
—In state of the art systems, workload scheduling and server fan speed operate independently leading to cooling inefficiencies. In this work we propose GentleCool, a proactive m...
To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time befor...
In this paper, we describe a local safety knowledge sharing system among the community. Our system makes it possible to share detailed safety knowledge, which includes location an...
Abstract The uncontrolled propagation of faults due to malicious intrusion can severely decrease system performance and survivability. Our goal is to employ available information a...