Sciweavers

21 search results - page 2 / 5
» Job-Site Level Fault Tolerance for Cluster and Grid environm...
Sort
View
CLUSTER
2003
IEEE
13 years 11 months ago
Coordinated Checkpoint versus Message Log for Fault Tolerant MPI
— Large Clusters, high availability clusters and Grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programmin...
Aurelien Bouteiller, Pierre Lemarinier, Gér...
ESCIENCE
2006
IEEE
13 years 11 months ago
Practical Fault-Tolerant Framework for eScience Infrastructure
Many areas of science currently use computing resources as a important part of their research, and many research groups adopt cluster architecture to use them efficiently and mana...
Hyuck Han, Jai Wug Kim, Jongpil Lee, Youngjin Yu, ...
CCGRID
2008
IEEE
14 years 4 days ago
An Autonomic Workflow Management System for Global Grids
Workflow Management System is generally utilized to define, manage and execute workflow applications on Grid resources. However, the increasing scale complexity, heterogeneity and...
Mustafizur Rahman 0003, Rajkumar Buyya
GRID
2003
Springer
13 years 11 months ago
Faults in Grids: Why are they so bad and What can be done about it?
Computational Grids have the potential to become the main execution platform for high performance and distributed applications. However, such systems are extremely complex and pro...
Raissa Medeiros, Walfredo Cirne, Francisco Vilar B...
CCGRID
2006
IEEE
13 years 11 months ago
Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Yuan Tang, Graham E. Fagg, Jack Dongarra