Sciweavers

12 search results - page 2 / 3
» Fault tolerant MapReduce-MPI for HPC clusters
Sort
View
ICDCS
2012
IEEE
11 years 7 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
ICPP
2007
IEEE
13 years 11 months ago
Mercury: Combining Performance with Dependability Using Self-virtualization
There has recently been increasing interests in using system virtualization to improve the dependability of HPC cluster systems. However, it is not cost-free and may come with som...
Haibo Chen, Rong Chen, Fengzhe Zhang, Binyu Zang, ...
CCGRID
2006
IEEE
13 years 11 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra
FGCS
2002
153views more  FGCS 2002»
13 years 4 months ago
HARNESS fault tolerant MPI design, usage and performance issues
Initial versions of MPI were designed to work efficiently on multi-processors which had very little job control and thus static process models. Subsequently forcing them to suppor...
Graham E. Fagg, Jack Dongarra
HPDC
2000
IEEE
13 years 9 months ago
Distributed Processor Allocation in Large PC Clusters
Current processor allocation techniques for highly parallel systems are based on centralized front-end based algorithms. As a result, the applied strategies are restricted to stat...
Hans-Ulrich Heiss, César A. F. De Rose, Phi...