Sciweavers

5 search results - page 1 / 1
» Proposal of MPI Operation Level Checkpoint Rollback and One ...
Sort
View
CCGRID
2006
IEEE
13 years 10 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
ICPP
2009
IEEE
13 years 11 months ago
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales...
EMSOFT
2006
Springer
13 years 8 months ago
Implementing fault-tolerance in real-time systems by automatic program transformations
We present a formal approach to implement and certify fault-tolerance in real-time embedded systems. The faultintolerant initial system consists of a set of independent periodic t...
Tolga Ayav, Pascal Fradet, Alain Girault
IPPS
1997
IEEE
13 years 8 months ago
External Adjustment of Runtime Parameters in Time Warp Synchronized Parallel Simulators
Several optimizations to the Time Warp synchronization protocol for parallel discrete event simulation have been proposed and studied. Many of these optimizations have included so...
Radharamanan Radhakrishnan, Lantz Moore, Philip A....
SENSYS
2006
ACM
13 years 10 months ago
Capsule: an energy-optimized object storage system for memory-constrained sensor devices
Recent gains in energy-efficiency of new-generation NAND flash storage have strengthened the case for in-network storage by data-centric sensor network applications. This paper ...
Gaurav Mathur, Peter Desnoyers, Deepak Ganesan, Pr...