Sciweavers

1024 search results - page 136 / 205
» Fault Tolerance in Decentralized Systems
Sort
View
MIDDLEWARE
2009
Springer
15 years 4 months ago
Why Do Upgrades Fail and What Can We Do about It?
Abstract. Enterprise-system upgrades are unreliable and often produce downtime or data-loss. Errors in the upgrade procedure, such as broken dependencies, constitute the leading ca...
Tudor Dumitras, Priya Narasimhan
ISCA
2002
IEEE
115views Hardware» more  ISCA 2002»
15 years 2 months ago
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At...
Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, ...
HPDC
2009
IEEE
15 years 4 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
IPPS
2006
IEEE
15 years 3 months ago
Coordinated checkpoint from message payload in pessimistic sender-based message logging
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...
M. Aminian, Mohammad K. Akbari, Bahman Javadi
ISORC
2003
IEEE
15 years 3 months ago
A Dynamic Shadow Approach for Mobile Agents to Survive Crash Failures
Fault tolerance schemes for mobile agents to survive agent server crash failures are complex since developers normally have no control over remote agent servers. Some solutions mo...
Simon Pears, Jie Xu, Cornelia Boldyreff