Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
In the realm of data mining, several key issues exists in the traditional classification algorithms, such as low readability, large rule number, and low accuracy with information ...
Transient faults are single-shot hardware errors caused by high energy particles from space, manufacturing defects, overheating, and other sources. Such faults can be devastating f...
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. These faults do not cause permanent damage, but may result in incorr...
David Walker, Lester W. Mackey, Jay Ligatti, Georg...