Sciweavers

ASPLOS
2015
ACM

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

8 years 10 days ago
Memory Errors in Modern Systems: The Good, The Bad, and The Ugly
Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently∗ A portion of this work was performed at...
Vilas Sridharan, Nathan DeBardeleben, Sean Blancha
Added 16 Apr 2016
Updated 16 Apr 2016
Type Journal
Year 2015
Where ASPLOS
Authors Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, Sudhanva Gurumurthi
Comments (0)