Sciweavers

SIGMETRICS
2009
ACM

DRAM errors in the wild: a large-scale field study

13 years 11 months ago
DRAM errors in the wild: a large-scale field study
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held...
Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich
Added 28 May 2010
Updated 28 May 2010
Type Conference
Year 2009
Where SIGMETRICS
Authors Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber
Comments (0)