Search Sciweavers | Sciweavers

31 search results - page 2 / 7

» The Design and Implementation of Checkpoint Restart Process ...

click to vote

IPPS
2007
IEEE

95views Distributed And Parallel Com...» more IPPS 2007»

Implementing and Evaluating Automatic Checkpointing

14 years 3 months ago

Download www.cecs.uci.edu

As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...

Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...

claim paper

Read More »

click to vote

CLUSTER
2004
IEEE

140views Distributed And Parallel Com...» more CLUSTER 2004»

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

14 years 1 months ago

Download charm.cs.uiuc.edu

As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...

Gengbin Zheng, Lixia Shi, Laxmikant V. Kalé

claim paper

Read More »

click to vote

HPDC
2009
IEEE

101views Distributed And Parallel Com...» more HPDC 2009»

Interconnect agnostic checkpoint/restart in open MPI

14 years 4 months ago

Download www.osl.iu.edu

Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passi...

Joshua Hursey, Timothy Mattox, Andrew Lumsdaine

claim paper

Read More »

click to vote

IPPS
2005
IEEE

159views Distributed And Parallel Com...» more IPPS 2005»

Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

14 years 3 months ago

Download hpc.pnl.gov

Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hiberna...

José Carlos Sancho, Fabrizio Petrini, Kei D...

claim paper

Read More »

click to vote

IPPS
2007
IEEE

129views Distributed And Parallel Com...» more IPPS 2007»

A Fault Tolerance Protocol with Fast Fault Recovery

14 years 3 months ago

Download www.cecs.uci.edu

Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...

Sayantan Chakravorty, Laxmikant V. Kalé

claim paper

Read More »

« Prev « First page 2 / 7 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers