Sciweavers

1113 search results - page 2 / 223
» Performance under Failures of DAG-based Parallel Computing
Sort
View
IPPS
2009
IEEE
13 years 11 months ago
Robust sequential resource allocation in heterogeneous distributed systems with random compute node failures
—The problem of finding efficient workload distribution techniques is becoming increasingly important today for heterogeneous distributed systems where the availability of comp...
Vladimir Shestak, Edwin K. P. Chong, Anthony A. Ma...
JSSPP
2004
Springer
13 years 10 months ago
Performance Implications of Failures in Large-Scale Cluster Scheduling
As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those s...
Yanyong Zhang, Mark S. Squillante, Anand Sivasubra...
IPPS
2006
IEEE
13 years 11 months ago
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
As the desire of scientists to perform ever larger computations drives the size of today’s high performance computers from hundreds, to thousands, and even tens of thousands of ...
Zizhong Chen, Jack Dongarra
PPOPP
2005
ACM
13 years 10 months ago
Fault tolerant high performance computing by a coding approach
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
MOBIHOC
2010
ACM
13 years 2 months ago
Data preservation under spatial failures in sensor networks
In this paper, we address the problem of preserving generated data in a sensor network in case of node failures. We focus on the type of node failures that have explicit spatial s...
Navid Hamed Azimi, Himanshu Gupta, Xiaoxiao Hou, J...