Software distributed shared memory (DSM) improves the programmability of message-passing machines and workclusters by providing a shared memory abstract (i.e., a coherent global a...
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....
Abstract. This paper presents an adaptation of the ARIES recovery algorithm that solves the problem of recovery in Shared Disk (SD) database systems, whilst preserving all the desi...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
The emerging mobile wireless environment poses exciting challenges for distributed fault tolerant (FT) computing. This paper proposes a message loggingand recovery protocol on the...