Sciweavers

1186 search results - page 127 / 238
» The Communication in Intelligent Distributed Fault Tolerant ...
Sort
View
PPOPP
2003
ACM
15 years 11 months ago
Automated application-level checkpointing of MPI programs
Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance com...
Greg Bronevetsky, Daniel Marques, Keshav Pingali, ...
LATA
2009
Springer
15 years 10 months ago
On Parallel Communicating Grammar Systems and Correctness Preserving Restarting Automata
Abstract. This paper contributes to the study of Freely Rewriting Restarting Automata (FRR-automata) and Parallel Communicating Grammar Systems (PCGS) as formalizations of the ling...
Dana Pardubská, Martin Plátek, Fried...
HPCA
2007
IEEE
16 years 6 months ago
Evaluating MapReduce for Multi-core and Multiprocessor Systems
This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers...
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, G...
ICDCS
2008
IEEE
16 years 8 days ago
stdchk: A Checkpoint Storage System for Desktop Grid Computing
— Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This paper argues that...
Samer Al-Kiswany, Matei Ripeanu, Sudharshan S. Vaz...
ICPP
2007
IEEE
16 years 4 days ago
A Meta-Learning Failure Predictor for Blue Gene/L Systems
The demand for more computational power in science and engineering has spurred the design and deployment of ever-growing cluster systems. Even though the individual components use...
Prashasta Gujrati, Yawei Li, Zhiling Lan, Rajeev T...