Replication is a key technique for improving fault tolerance. Replication can also improve application performance under some circumstances, but can have the opposite effect under...
The popularity of distributed file systems continues to grow. Reasons they are preferred over traditional centralized file systems include fault tolerance, availability, scalabili...
Ragib Hasan, Zahid Anwar, William Yurcik, Larry Br...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
This paper presents a new challenge—verifying that a remote server is storing a file in a fault-tolerant manner, i.e., such that it can survive hard-drive failures. We describe...
Kevin D. Bowers, Marten van Dijk, Ari Juels, Alina...
The importance of fault tolerance at the processor architecture level has been made increasingly important due to rapid advancements in the design and usage of high performance de...
T. S. Ganesh, Viswanathan Subramanian, Arun K. Som...