Sciweavers

62 search results - page 2 / 13
» Checkpoint and Recovery Methods in the ParaSol Simulation Sy...
Sort
View
IEEEHPCS
2010
13 years 3 months ago
Resilient workflows for high-performance simulation platforms
Workflows systems are considered here to support largescale multiphysics simulations. Because the use of large distributed and parallel multi-core infrastructures is prone to soft...
Toan Nguyen, Laurentiu Trifan, Jean-Antoine Deside...
ICDCS
2000
IEEE
13 years 10 months ago
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
To assure dependable onboard evolution, we have developed a methodology called guarded software upgrading (GSU). In this paper, we focus on a low-cost approach to error containmen...
Ann T. Tai, Kam S. Tso, Leon Alkalai, Savio N. Cha...
TMC
2010
143views more  TMC 2010»
13 years 4 months ago
Decentralized QoS-Aware Checkpointing Arrangement in Mobile Grid Computing
—This paper deals with decentralized, QoS-aware middleware for checkpointing arrangement in Mobile Grid (MoG) computing systems. Checkpointing is more crucial in MoG systems than...
Paul J. Darby III, Nian-Feng Tzeng
SRDS
1994
IEEE
13 years 9 months ago
Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers
Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sens...
G. Janakiraman, Yuval Tamir
IPPS
2007
IEEE
14 years 1 days ago
A Fault Tolerance Protocol with Fast Fault Recovery
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Sayantan Chakravorty, Laxmikant V. Kalé