Experimental Assessment of Parallel Systems

11 years 2 months ago
Experimental Assessment of Parallel Systems
In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel machine running several real applications. The results showed that a significant percentage of faults caused the system to produce wrong results while the application seemed to terminate normally, thus demonstrating that fault tolerance techniques are required in parallel systems, not only to assure that long-running applications can terminate but also (and more important) that the results produced are correct. Of the techniques tested to reduce the percentage of undetected wrong results only ABFT proved to be effective. For other simple error detection methods to be effective, they have to be designed in, and not added as an after thought. Faults injected in the communication subsystem proved the effectiveness of end-to-end CRCs on the data movements between processors.
João Gabriel Silva, Joao Carreira, Henrique
Added 02 Nov 2010
Updated 02 Nov 2010
Type Conference
Year 1996
Where FTCS
Authors João Gabriel Silva, Joao Carreira, Henrique Madeira, Diamantino Costa, Francisco Moreira
Comments (0)