Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
Cooperative checkpointing uses global knowledge of the state and health of the machine to improve performance and reliability by dynamically deciding when to skip checkpoint reque...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
— High-end computing (HEC) systems have passed the petaflop barrier and continue to move toward the next frontier of exascale computing. As companies and research institutes con...
Narayan Desai, Darius Buntinas, Daniel Buettner, P...
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM’s BlueGene/L which can acc...