Sciweavers

1038 search results - page 125 / 208
» Distributed Fault Tolerant Controllers
Sort
View
HPDC
2009
IEEE
15 years 4 months ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
IPPS
2006
IEEE
15 years 4 months ago
Coordinated checkpoint from message payload in pessimistic sender-based message logging
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...
M. Aminian, Mohammad K. Akbari, Bahman Javadi
DSN
2003
IEEE
15 years 3 months ago
Integrating Recovery Strategies into a Primary Substation Automation System
The DepAuDE architecture provides middleware to integrate fault tolerance support into distributed embedded automation applications. It allows error recovery to be expressed in te...
Geert Deconinck, Vincenzo De Florio, Ronnie Belman...
HPDC
2000
IEEE
15 years 2 months ago
Robust Resource Management for Metacomputers
In this paper we present a robust software infrastructure for metacomputing. The system is intended to be used by others as a building block for large and powerful computational g...
Jörn Gehring, Achim Streit
HIPC
2000
Springer
15 years 1 months ago
Experiments with the CHIME Parallel Processing System
: This paper presents the results from running five experiments with the Chime Parallel Processing System. The Chime System is an implementation of the CC++ programming language (p...
Anjaneya R. Chagam, Partha Dasgupta, Rajkumar Khan...