As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
One of the mostsoughtaftersoftware innovation of thisdecade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and ...
We describe and test a software approach to overcoming radiation-induced errors in spaceborne applications running on commercial off-the-shelf components. The approach uses checks...
—In this paper we describe QsNetIII , an adaptively routed network for High Performance Computing (HPC) applications. We detail the structure of the network, the evolution of our...
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not c...