Search Sciweavers | Sciweavers

31 search results - page 1 / 7

» The Design and Implementation of Checkpoint Restart Process ...

click to vote

IPPS
2007
IEEE

137views Distributed And Parallel Com...» more IPPS 2007»

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

13 years 10 months ago

Download www.open-mpi.org

To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...

Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...

claim paper

Read More »

click to vote

CLUSTER
2003
IEEE

165views Distributed And Parallel Com...» more CLUSTER 2003»

Coordinated Checkpoint versus Message Log for Fault Tolerant MPI

13 years 9 months ago

Download www.cs.utk.edu

— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...

Aurelien Bouteiller, Pierre Lemarinier, Gér...

claim paper

Read More »

click to vote

ICPADS
2010
IEEE

169views Distributed And Parallel Com...» more ICPADS 2010»

Hybrid Checkpointing for MPI Jobs in HPC Environments

13 years 2 months ago

Download moss.csc.ncsu.edu

As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Checkpointing addresses such faults but captures full process images ev...

Chao Wang, Frank Mueller, Christian Engelmann, Ste...

claim paper

Read More »

click to vote

HIPC
2009
Springer

146views Distributed And Parallel Com...» more HIPC 2009»

Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture

13 years 2 months ago

Download nowlab.cse.ohio-state.edu

Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...

Xiangyong Ouyang, Karthik Gopalakrishnan, Tejus Ga...

claim paper

Read More »

click to vote

PVM
2005
Springer

78views Distributed And Parallel Com...» more PVM 2005»

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

13 years 10 months ago

Download icl.cs.utk.edu

ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications diﬀerent methods to handle process failures beyond simple check-point restart schemes. The init...

Graham E. Fagg, Thara Angskun, George Bosilca, Jel...

claim paper

Read More »

« Prev « First page 1 / 7 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers