: In a decentralised system the problems of fault tolerance, and in particular error recovery, vary greatly depending on the design assumptions. For example, in a distributed datab...
: A general purpose group communication protocol suite called Newtop is described. It is assumed that processes can simultaneously belong to many groups, group size could be large,...
In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our ...
CCS is a resource management system for parallel high-performance computers. At the user level, CCS provides vendor-independent access to parallel systems. At the system administr...
The current multiprocessors such asCray T3D support interprocessor communication using partitioned dimension-order routers (PDRs). In a PDR implementation, the routing logic and sw...