Drive failure is something most computer users experience sooner or later. All hard drives fail eventually. In some cases, however, another component fails first, leaving you with a dead computer but a living drive. If you’re really lucky, you may find yourself replacing a computer not because it’s broken, but because it’s hopelessly obsolete.
I’ve been very lucky in the drive department, myself. Not that it’s any fun having, say, memory meltdowns (my 1993 Mac PowerBook 145B) or power-supply failures (my 2002 Enpower laptop), but those don’t necessarily destroy your data when they go. The damage to your bank account for repairs—or even replacement—is unlikely to approach the cost of data recovery.
Hard drive manufacturers like to claim that their drives have a mean time to failure (MTTF) of 1 million hours or more. That’s just over 114 years, which is completely preposterous.
Where do drive manufacturers get numbers like this? A drive that still works after ten years is possible, if not common. A drive that still works after a hundred years…would be useless anyway, at least if file sizes keep on increasing the way they have over the past few decades.
These MTTF claims lead you to wonder whether drive manufacturers have the same definition of “failure” that their customers do. The answer is “No.” Manufacturers define “failure” pretty rigorously. Consumers and businesses define failure as “We can’t use it for the purpose we bought it for.” Even if the fault isn’t in the drive, if the drive doesn’t play nicely with the rest of your system, it’s no good to you and you have to replace it with one that does.
One reason that manufacturer datasheets and user experience are so different is that the MTTF on the datasheet is based on artificial “accelerated aging” tests. Two recent studies, one from Google and one from Carnegie Mellon, demonstrate pretty conclusively that these tests do a lousy job of simulating the real effects of age on hard drives.
They also provide a few other dramatic surprises, at least to those familiar with the received wisdom about why drives fail. Everyone knows drives fail when they get too hot, right? Well…wrong. Google found “a clear trend showing that lower temperatures are associated with higher failure rates.” And they’re not talking about freezing temperatures, either.
Heavy usage just seems intuitively right as a reason for drive failure. After all, the more miles you drive your car, the more likely it is to break down. But according to Google’s survey of more than 100,000 disks, “After the first year, the AFR [annual failure rate] of high utilization drives is at most moderately higher than that of low utilization drives.”
Failure in the first year is known as “infant mortality.” Hard drives apparently have a pretty high infant mortality rate. Using a drive heavily seems to be the best way to find out whether it will die young—though that’s learning the hard way.
Drive failure rates were thought to follow a “bathtub curve”: fairly high in the first few months to a year (infant mortality), followed by a low but steady rate (the flat bottom of the bathtub) followed by a steep increase after a few years. Google’s results show a roughly 3% AFR at three months, dropping to 2% at one year—and jumping up to 8% at two years. Their graph doesn’t look much like a bathtub. And the Carnegie Mellon researchers “observed a continuous increase in replacement rates, starting as early as in the second year of operation.”
The Carnegie Mellon study also covers about 100,000 disks, most of these higher-end than the ones used by Google. (Google, it seems, buys cheap consumer disks, perhaps because the cost of replacement is lower.) These are the disks with the claimed MTTF of 1 million hours and corresponding alleged annual failure rate of less than 1%. The study found that the annual replacement rate of these various kinds of disks ranged from a low of about .5% to a high of 13.5%, with a weighted average of 3%.
And on top of that, drive failure in RAID systems seems to be contagious. I’ve forgotten most of what math I ever knew, but despite its somewhat dense language, the Carnegie Mellon study is clear enough: “the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data, compared to the exponential distribution. The probability of seeing two drives in the cluster fail within the same 10 hours is two times larger under the real data, compared to the exponential distribution.” In this case, all you really need to know about “exponential distribution” is that it means “the way we expected it to be.”
The problem with statistics, like all generalizations, is that they’re no good at all when it comes to predicting individual behavior. The Google study shows that even a single scan error makes your drive 39 times more likely to fail within 60 days—but that’s not a guarantee. It could last longer, just as your car could continue to work just fine after 100,000 miles. Frankly, I’d put more faith in the car.
On the other hand, even the 8% failure rate Google found at 2 years is still pretty good odds. Certainly not as good as manufacturers’ data sheets would lead you to expect, but the hard drive on my PowerBook 145B still worked when I recycled it some ten years after buying it. The drive on my 1997 laptop functioned just fine in 2002, even though it was starting to feel a little cramped. None of the seven computers I’ve owned has suffered from drive failure.
But as far as I can tell from my own anecdotal experience, I’ve been unusually lucky. And it’s important to remember that hardware failure only causes a small fraction of the world’s data loss. There are system errors, software problems, natural disasters, theft—and the big one, human error. So even if failure rates several times higher than expected don’t scare you, there’s still no excuse for not backing up your data.
Leave a Reply