Sciweavers

31 search results - page 3 / 7
» Failure Recovery for Distributed Processes in Single System ...
Sort
View
ICPP
1999
IEEE
13 years 9 months ago
Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory
The probability of failures in software distributed shared memory (SDSM) increases as the system size grows. This paper introduces a new, efficient message logging technique, call...
Angkul Kongmunvattana, Nian-Feng Tzeng
PVM
2005
Springer
13 years 10 months ago
Scalable Fault Tolerant MPI: Extending the Recovery Algorithm
ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The init...
Graham E. Fagg, Thara Angskun, George Bosilca, Jel...
CCGRID
2006
IEEE
13 years 11 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
NOMS
2002
IEEE
133views Communications» more  NOMS 2002»
13 years 10 months ago
Highly available and efficient load cluster management system using SNMP and Web
To cope with the explosive increase in the number of requests to Internet server systems, one popular solution is a load-balancing technique that uses a dispatcher in the front-en...
Myung-Sup Kim, Mi-Jeong Choi, James W. Hong
FAST
2007
13 years 6 months ago
Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we presen...
Bianca Schroeder, Garth A. Gibson