Sciweavers

482 search results - page 3 / 97
» A large-scale study of failures in high-performance computin...
Sort
View
ICPPW
2008
IEEE
14 years 21 days ago
Simulating Failures on Large-Scale Systems
—Developing fault management mechanisms is a difficult task because of the unpredictable nature of failures. In this paper, we present a fault simulation framework for Blue Gene...
Narayan Desai, Ewing L. Lusk, Daniel Buettner, And...
CCGRID
2006
IEEE
14 years 9 days ago
A Failure-Aware Scheduling Strategy in Large-Scale Cluster System
As the scale is expanding, node failure becomes a commonplace feature of large-scale cluster systems. As an important part of cluster operating system software, job scheduling tak...
Linping Wu, Dan Meng, Jianfeng Zhan, Wang Lei, Bib...
JSSPP
2004
Springer
13 years 11 months ago
Performance Implications of Failures in Large-Scale Cluster Scheduling
As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those s...
Yanyong Zhang, Mark S. Squillante, Anand Sivasubra...
DSN
2004
IEEE
13 years 10 months ago
Cluster-Based Failure Detection Service for Large-Scale Ad Hoc Wireless Network Applications
The growing interest in ad hoc wireless network applications that are made of large and dense populations of lightweight system resources calls for scalable approaches to fault to...
Ann T. Tai, Kam S. Tso, William H. Sanders
SC
2000
ACM
13 years 10 months ago
The Failure of TCP in High-Performance Computational Grids
Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be a...
Wu-chun Feng, Peerapol Tinnakornsrisuphap