Sciweavers

48 search results - page 9 / 10
» Self-stabilizing algorithm for checkpointing in a distribute...
Sort
View
ICPP
1987
IEEE
13 years 9 months ago
A Software-Based Hardware Fault Tolerance Scheme for Multicomputers
-- A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by so...
Yuval Tamir, Eli Gafni
IPPS
2007
IEEE
14 years 3 days ago
The Adaptive Code Kitchen: Flexible Tools for Dynamic Application Composition
Driven by the increasing componentization of scientific codes, the deployment of high-end system infrastructures such as the Grid, and the desire to support high level problem so...
Pilsung Kang 0002, Mike Heffner, Joy Mukherjee, Na...
ICPP
2007
IEEE
14 years 4 days ago
Fault-Driven Re-Scheduling For Improving System-level Fault Resilience
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
Yawei Li, Prashasta Gujrati, Zhiling Lan, Xian-He ...
HPDC
2009
IEEE
14 years 17 days ago
Interconnect agnostic checkpoint/restart in open MPI
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...
Joshua Hursey, Timothy Mattox, Andrew Lumsdaine
CCGRID
2007
IEEE
14 years 5 days ago
Reparallelization and Migration of OpenMP Programs
Typical computational grid users target only a single cluster and have to estimate the runtime of their jobs. Job schedulers prefer short-running jobs to maintain a high system ut...
Michael Klemm, Matthias Bezold, Stefan Gabriel, Ro...