Sciweavers

535 search results - page 2 / 107
» Fault tolerant high performance computing by a coding approa...
Sort
View
SC
2005
ACM
13 years 10 months ago
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers
We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented ...
Roberto Gioiosa, José Carlos Sancho, Song J...
DSN
2000
IEEE
13 years 9 months ago
Software-Implemented Fault Detection for High-Performance Space Applications
We describe and test a software approach to overcoming radiation-induced errors in spaceborne applications running on commercial off-the-shelf components. The approach uses checks...
Michael J. Turmon, Robert Granat, Daniel S. Katz
IPPS
2007
IEEE
13 years 11 months ago
Self Adaptive Application Level Fault Tolerance for Parallel and Distributed Computing
Most application level fault tolerance schemes in literature are non-adaptive in the sense that the fault tolerance schemes incorporated in applications are usually designed witho...
Zizhong Chen, Ming Yang, Guillermo A. Francia III,...
IJHPCA
2006
117views more  IJHPCA 2006»
13 years 4 months ago
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
Abstract-- High performance computing platforms like Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message...
Aurelien Bouteiller, Thomas Hérault, G&eacu...
INFOCOM
2012
IEEE
11 years 7 months ago
Experimental performance comparison of Byzantine Fault-Tolerant protocols for data centers
Abstract—In this paper, we implement and evaluate three different Byzantine Fault-Tolerant (BFT) state machine replication protocols for data centers: (1) BASIC: The classic solu...
Guanfeng Liang, Benjamin Sommer, Nitin H. Vaidya