Sciweavers

CCGRID
2008
IEEE

Fault Tolerance in Cluster Federations with O2P-CF

13 years 6 months ago
Fault Tolerance in Cluster Federations with O2P-CF
Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide huge computing power. To work efficiently on such systems, networks characteristics have to be taken into account: the latency between two nodes of different clusters is much higher than the latency between two nodes of the same cluster. In this paper, we present O2P-CF a message logging protocol well-suited to provide fault tolerance for message passing applications executed on cluster federations. O2P-CF is based on the combination of O2P, an extremely optimistic message logging protocol, with a pessimistic message logging protocol.
Thomas Ropars, Christine Morin
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CCGRID
Authors Thomas Ropars, Christine Morin
Comments (0)