We propose a generalized forward recovery checkpointing scheme, with lookahead execution and rollback validation. This method takes advantage of voting and comparison on multiple v...
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Large instruction window processors achieve high performance by exposing large amounts of instruction level parallelism. However, accessing large hardware structures typically req...
Haitham Akkary, Ravi Rajwar, Srikanth T. Srinivasa...
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
It is important that long running server programs retain availability amidst software failures. However, server programs do fail and one of the important causes of failures in ser...