Sciweavers

FAST
2007

Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?

13 years 5 months ago
Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replac...
Bianca Schroeder, Garth A. Gibson
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where FAST
Authors Bianca Schroeder, Garth A. Gibson
Comments (0)