Sciweavers

361 search results - page 2 / 73
» Adaptive Fault Management of Parallel Applications for High-...
Sort
View
IPPS
2008
IEEE
13 years 11 months ago
Enhancing application robustness through adaptive fault tolerance
As the scale of high performance computing (HPC) continues to grow, application fault resilience becomes crucial. To address this problem, we are working on the design of an adapt...
Zhiling Lan, Yawei Li, Ziming Zheng, Prashasta Guj...
ICDCS
1995
IEEE
13 years 8 months ago
Parallel Processing on Networks of Workstations: A Fault-Tolerant, High Performance Approach
One of the mostsoughtaftersoftware innovation of thisdecade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and ...
Partha Dasgupta, Zvi M. Kedem, Michael O. Rabin
DSN
2000
IEEE
13 years 9 months ago
Software-Implemented Fault Detection for High-Performance Space Applications
We describe and test a software approach to overcoming radiation-induced errors in spaceborne applications running on commercial off-the-shelf components. The approach uses checks...
Michael J. Turmon, Robert Granat, Daniel S. Katz
HOTI
2008
IEEE
13 years 11 months ago
QsNetIII an Adaptively Routed Network for High Performance Computing
—In this paper we describe QsNetIII , an adaptively routed network for High Performance Computing (HPC) applications. We detail the structure of the network, the evolution of our...
Duncan Roweth, Trevor Jones
IPPS
2000
IEEE
13 years 8 months ago
Fault Tolerant Wide-Area Parallel Computing
Executing parallel applications across distributed networks introduces the problem of fault tolerance. A viable solution for fault tolerance must keep overhead manageable and not c...
Jon B. Weissman