As the scale of high-performance computing (HPC) continues to grow, failure resilience of parallel applications becomes crucial. In this paper, we present FT-Pro, an adaptive fault...
In many distributed computing systems that are prone to either induced or spontaneous node failures, the number of available computing resources is dynamically changing in a rando...
Sagar Dhakal, Majeed M. Hayat, Jorge E. Pezoa, Cha...
Condition based maintenance (CBM) seeks to generate a design for a new ship wide CMB system that performs diagnoses and failure prediction on Navy shipboard machinery. Eventually, ...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...