Sciweavers

716 search results - page 73 / 144
» Tolerating Faults in Synchronization Networks
Sort
View
ICDCS
2012
IEEE
13 years 2 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
DSN
2004
IEEE
15 years 3 months ago
Implementing Simple Replication Protocols using CORBA Portable Interceptors and Java Serialization
The goal of this paper is to assess the value of simple features that are widely available in off-the-shelf CORBA and Java platforms for the implementation of faulttolerance mecha...
Taha Bennani, Laurent Blain, Ludovic Courtè...
NSDI
2010
15 years 1 months ago
MapReduce Online
MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire outp...
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M....
PODC
2009
ACM
15 years 6 months ago
Fast scalable deterministic consensus for crash failures
We study communication complexity of consensus in synchronous message-passing systems with processes prone to crashes. The goal in the consensus problem is to have all the nonfaul...
Bogdan S. Chlebus, Dariusz R. Kowalski, Michal Str...
EDCC
2008
Springer
15 years 1 months ago
A Distributed Approach to Autonomous Fault Treatment in Spread
This paper presents the design and implementation of the Distributed Autonomous Replication Management (DARM) framework built on top of the Spread group communication system. The ...
Hein Meling, Joakim L. Gilje