Sciweavers

ICPP
1987
IEEE

A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

13 years 8 months ago
A Software-Based Hardware Fault Tolerance Scheme for Multicomputers
-- A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logical states of the two subnetworks are synchronized. Errors are detected by comparing the ``frozen'' synchronized states of the two subnetworks while they are being saved as ``checkpoints'' for possible subsequent use for error recovery. Algorithms for error detection and recovery using this scheme are discussed.
Yuval Tamir, Eli Gafni
Added 28 Aug 2010
Updated 28 Aug 2010
Type Conference
Year 1987
Where ICPP
Authors Yuval Tamir, Eli Gafni
Comments (0)