As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...
Checkpoint and Recovery (CPR) systems have many uses in high-performance computing. Because of this, many developers have implemented it, by hand, into their applications. One of ...
Greg Bronevetsky, Rohit Fernandes, Daniel Marques,...
This paper presents a new functionality of the Automatic Differentiation (AD) Tool tapenade. tapenade generates adjoint codes which are widely used for optimization or inverse prob...
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....