Search Sciweavers | Sciweavers

37 search results - page 2 / 8

» High performance linpack benchmark: a fault tolerant impleme...

click to vote

CLUSTER
2004
IEEE

180views Distributed And Parallel Com...» more CLUSTER 2004»

Improved message logging versus improved coordinated checkpointing for fault tolerant MPI

13 years 8 months ago

Download www.cs.utk.edu

Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...

Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...

claim paper

Read More »

click to vote

HPDC
2011
IEEE

236views Distributed And Parallel Com...» more HPDC 2011»

Algorithm-based recovery for iterative methods without checkpointing

12 years 8 months ago

Download inside.mines.edu

In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...

Zizhong Chen

claim paper

Read More »

click to vote

ICDCS
2012
IEEE

238views Distributed And Parallel Com...» more ICDCS 2012»

Combining Partial Redundancy and Checkpointing for HPC

11 years 6 months ago

Download moss.csc.ncsu.edu

Today’s largest High Performance Computing (HPC) systems exceed one Petaﬂops (1015 ﬂoating point operations per second) and exascale systems are projected within seven years...

James Elliott, Kishor Kharbas, David Fiala, Frank ...

claim paper

Read More »

click to vote

IPPS
2007
IEEE

137views Distributed And Parallel Com...» more IPPS 2007»

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

13 years 10 months ago

Download www.open-mpi.org

To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...

Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...

claim paper

Read More »

click to vote

CLUSTER
2004
IEEE

140views Distributed And Parallel Com...» more CLUSTER 2004»

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

13 years 8 months ago

Download charm.cs.uiuc.edu

As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...

Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé

claim paper

Read More »

« Prev « First page 2 / 8 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers