Sciweavers

31 search results - page 2 / 7
» Failure Recovery for Distributed Processes in Single System ...
Sort
View
SC
2009
ACM
13 years 11 months ago
FALCON: a system for reliable checkpoint recovery in shared grid environments
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For gu...
Tanzima Zerin Islam, Saurabh Bagchi, Rudolf Eigenm...
SRDS
1999
IEEE
13 years 9 months ago
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workclusters by providing a shared memory abstract (i.e., a coherent global a...
Angkul Kongmunvattana, Nian-Feng Tzeng
SRDS
1999
IEEE
13 years 9 months ago
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
Numerous mathematical approaches have been proposed to determine the optimal checkpoint interval for minimizing total execution time of an application in the presence of failures....
Kuo-Feng Ssu, Bin Yao, W. Kent Fuchs
IWCC
1999
IEEE
13 years 9 months ago
Single I/O Space for Scalable Cluster Computing
In this paper, we propose a novel Single I/O Space architecture for achieving a Single System Image (SSI) at the I/O subsystem level. This is very much desired in a scalable clust...
Roy S. C. Ho, Hai Jin, Kai Hwang
CLUSTER
2008
IEEE
13 years 11 months ago
Reliable adaptable Network RAM
Abstract—We present reliability solutions for adaptable Network RAM systems running on general-purpose clusters. Network RAM allows nodes with over-committed memory to swap pages...
Tia Newhall, Daniel Amato, Alexandr Pshenichkin