Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
FPGA-based designs are more susceptible to single-event upsets (SEUs) compared to ASIC designs. Soft error rate (SER) estimation is a crucial step in the design of soft error tole...
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire outp...
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M....
Over the past years, mobile agent technology has attracted considerable attention, and a significant body of literature has been published. To further develop mobile agent technol...
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containmen...
Ann T. Tai, Kam S. Tso, Leon Alkalai, Savio N. Cha...