Search Sciweavers | Sciweavers

9 search results - page 1 / 2

» Performance Implications of Periodic Checkpointing on Large-...

click to vote

IPPS
2005
IEEE

132views Distributed And Parallel Com...» more IPPS 2005»

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

13 years 10 months ago

Download adam.oliner.net

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...

Adam J. Oliner, Ramendra K. Sahoo, José E. ...

claim paper

Read More »

click to vote

JSSPP
2004
Springer

143views Distributed And Parallel Com...» more JSSPP 2004»

Performance Implications of Failures in Large-Scale Cluster Scheduling

13 years 10 months ago

Download www.ece.rutgers.edu

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those s...

Yanyong Zhang, Mark S. Squillante, Anand Sivasubra...

claim paper

Read More »

click to vote

IPPS
2006
IEEE

138views Distributed And Parallel Com...» more IPPS 2006»

Lossless compression for large scale cluster logs

13 years 10 months ago

Download www.cecs.uci.edu

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM’s Blue Gene/L which can acc...

R. Balakrishnan, Ramendra K. Sahoo

claim paper

Read More »

click to vote

ICPP
2009
IEEE

185views Distributed And Parallel Com...» more ICPP 2009»

Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

13 years 11 months ago

Download nowlab.cse.ohio-state.edu

—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...

Xiangyong Ouyang, Karthik Gopalakrishnan, Dhabales...

claim paper

Read More »

click to vote

CLUSTER
2003
IEEE

165views Distributed And Parallel Com...» more CLUSTER 2003»

Coordinated Checkpoint versus Message Log for Fault Tolerant MPI

13 years 10 months ago

Download www.cs.utk.edu

— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...

Aurelien Bouteiller, Pierre Lemarinier, Gér...

claim paper

Read More »

« Prev « First page 1 / 2 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers