What do our computer systems do all day? How do we make sure they continue doing it when failures occur? Traditional approaches to answering these questions often involve inband m...
Dan Pelleg, Muli Ben-Yehuda, Richard Harper, Lisa ...
This paper describes the design and implementation of SecondSite, a cloud-based service for disaster tolerance. SecondSite extends the Remus virtualization-based high availability...
Shriram Rajagopalan, Brendan Cully, Ryan O'Connor,...
: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...
We demonstrate an improved consensus-driven utility accrual scheduling algorithm (DUA-CLA) for distributable threads which execute under run-time uncertainties in execution time, ...
Jonathan Stephen Anderson, Binoy Ravindran, E. Dou...