Sciweavers

230 search results - page 2 / 46
» Checkpointing Aided Parallel Execution Model and Analysis
Sort
View
HPCC
2010
Springer
13 years 6 months ago
A Generic Execution Management Framework for Scientific Applications
Managing the execution of scientific applications in a heterogeneous grid computing environment can be a daunting task, particularly for long running jobs. Increasing fault tolera...
Tanvire Elahi, Cameron Kiddle, Rob Simmonds
SIGSOFT
2007
ACM
14 years 7 months ago
Efficient checkpointing of java software using context-sensitive capture and replay
Checkpointing and replaying is an attractive technique that has been used widely at the operating/runtime system level to provide fault tolerance. Applying such a technique at the...
Guoqing Xu, Atanas Rountev, Yan Tang, Feng Qin
CLUSTER
2005
IEEE
13 years 12 months ago
Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments
Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as ...
Daniel Nurmi, John Brevik, Richard Wolski
PODC
1994
ACM
13 years 10 months ago
A Checkpoint Protocol for an Entry Consistent Shared Memory System
Workstation clusters are becoming an interesting alternative to dedicated multiprocessors. In this environment, the probability of a failure, during an application's executio...
Nuno Neves, Miguel Castro, Paulo Guedes
ICDCS
2012
IEEE
11 years 8 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...