We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to im...
Mehul A. Shah, Joseph M. Hellerstein, Eric A. Brew...
Diagnosing production run failures is a challenging yet important task. Most previous work focuses on offsite diagnosis, i.e. development site diagnosis with the programmers prese...
Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanth...
—While many-core accelerator architectures, such as today’s Graphics Processing Units (GPUs), offer orders of magnitude more raw computing power than contemporary CPUs, their m...
Aaron Ariel, Wilson W. L. Fung, Andrew E. Turner, ...
Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individ...
Shuguang Feng, Shantanu Gupta, Amin Ansari, Scott ...
The UNIC code is being developed as part of the DOE’s Nuclear Energy Advanced Modeling and Simulation (NEAMS) program. UNIC is an unstructured, deterministic neutron transport c...
Dinesh K. Kaushik, Micheal Smith, Allan Wollaber, ...