Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...
Anomaly detection systems largely depend on user profile data to be able to detect deviation from normal activity. Most of this profile data is based on commands executed by use...
Current petascale systems have tens of thousands of hardware components and complex system software stacks, which increase the probability of faults occurring during the lifetime ...
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety ...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...