Sciweavers

149 search results - page 4 / 30
» The Performance of Coordinated and Independent Checkpointing
Sort
View
PPOPP
2003
ACM
13 years 11 months ago
Automated application-level checkpointing of MPI programs
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance com...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
PVLDB
2008
110views more  PVLDB 2008»
13 years 5 months ago
Fault-tolerant stream processing using a distributed, replicated file system
We present SGuard, a new fault-tolerance technique for distributed stream processing engines (SPEs) running in clusters of commodity servers. SGuard is less disruptive to normal s...
YongChul Kwon, Magdalena Balazinska, Albert G. Gre...
SRDS
1999
IEEE
13 years 10 months ago
Logging and Recovery in Adaptive Software Distributed Shared Memory Systems
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workclusters by providing a shared memory abstract (i.e., a coherent global a...
Angkul Kongmunvattana, Nian-Feng Tzeng
HASE
1996
IEEE
13 years 10 months ago
Adaptive recovery for mobile environments
Mobile computing allows ubiquitous and continuousaccess to computing resources while the users travel or work at a client's site. The flexibility introduced by mobile computi...
Nuno Neves, W. Kent Fuchs
ICDCS
2012
IEEE
11 years 8 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...