Author Topic: Should we pre-emptively retire old hard drives? (Read 61136 times)

mouser · « **on:** August 02, 2012, 01:52 AM »

We all know that hard drives can and will fail eventually, and often unpredictably and without warning. That's why we make sure we back up regularly.

But here's is a question I've been thinking about lately, and I don't know the answer to:

Should we pre-emptively retire old but perfectly-working hard drives, and migrate data to a new drive? If so, after how many hours?

Or should we just run them into the ground until they fail?

Here's a screenshot of one of my favorite tools (CrystalDiskInfo), showing smart data of my oldest drive, with 39,000 hours powered up:
Screenshot - 8_2_2012 , 1_44_06 AM.png

Is powered-up hours even the right metric to use -- or should we be using the actual years since manufacture?

JoTo · « **Reply #1 on:** August 02, 2012, 02:13 AM »

Hi mousie,

well, my experience shows me that an electronic device has the following lifetime failure curve: Either its DOA (death on arrival) or it will break after the first few days/weeks of use. If that period is passed, the electronic device works reliable as long as no external desasters (shock, overvoltage) takes place. This period will last very long and mostly i exchange the whole pc before that electronic device will brake.

Normal HDDs (not SSDs) are a bit of another story, as they contain, beside the electronic, mechanical parts (spinning plates, hopping heads). That means that the electronic part of a hdd follows the lifecycle i mentioned above, while the mechanic parts suffer more and more and wear out from steady use.

So it looks in the first place a good idea to exhcange HDDs regularly before fail. But this inherits another danger to run into an electronic failure of a new HDD and you still don't have the guarantee, that the mechanic parts of the new HDD will work as expected. And this counterparts the idea of pre-emptive exchange of HDDs.

IMO from my 25 years experience in IT business (handling also technical problems - incl. hardware failures - of customers on the support line), it makes no sense to exchange in advance. Keep good backups and use a raid system (what we do with our customers) which enables you to nearly hot-plug-exchange a failed HDD within only minutes of outage of the system without any data loss. Then run the HDDs until they brake again. That saves money and resources too.

Of course, if any symptoms arise (bad blocks etc.) even if the HDD is still working fine (except that infrequent little glitches), that is a sure sign to exchange the HDD IMMEDIATELY!

Just my 2ct.

Greetings
JoTo

vlastimil · « **Reply #2 on:** August 02, 2012, 04:57 AM »

I would say no. If the drive is in good state, I would let it run not until it dies, but until more space or capacity is needed.

Stoic Joker · « **Reply #3 on:** August 02, 2012, 06:45 AM »

Personally, I only replace things when they either die, or get flaky, but I'm also a big fan of hot-swapable hardware RAID. however I have run across folks that used the MTBF as a replacement point for drives in their companies (high end commercial) SAN. I'm sort of torn as to the logic in that as it seems to me that you'd just be swapping one evil (disk failure), for another (RAID rebuild) ... either of which has potential to end catastrophically.

Perhaps in mission critical high availability systems it would be best to schedule ones performance hiccup outages by replacing drives on a preemptive basis (at considerable co$t...). But outside of that it strikes me as a bit wasteful and rather silly. Did I mention I tend to be just a tad on the cheap side..?

Renegade · « **Reply #4 on:** August 02, 2012, 06:59 AM »

Did I mention I tend to be just a tad on the cheap side..?
-Stoic Joker (August 02, 2012, 06:45 AM)

Hahahah~!

For myself, I think that the decision to replace hardware first needs to be balanced against the likelihood of failure and the cost to recover.

Typically, if you have backups, and can recover with them, then the question is about how much the cost of unplanned downtime is versus the cost of planned downtime for the replacement.

Copying data is trivial. Configuring a server or setting up a new desktop isn't. They take time and are painful.

I guess is all boils down to the specific situation. I'm all in favor of the cheapest one too though!

(Generally, I find the cost of a hard drive is significantly less than the cost of disaster recovery and setting up a server again -- assuming the backups are for data, and not the system. I have a complete system backup on this desktop, but still wouldn't want to fart around with recovery.)

dantheman · « **Reply #5 on:** August 02, 2012, 08:01 AM »

Speaking of hard drives!

I've got a bad Linux (Xfce) installation and would like to wipe the drive.
Can't even access Firefox or Synaptic!

Probably need a software to run from usb / cd or dvd.

40hz · « **Reply #6 on:** August 02, 2012, 08:14 AM »

Speaking of hard drives!

I've got a bad Linux (Xfce) installation and would like to wipe the drive.
Can't even access Firefox or Synaptic!

Probably need a software to run from usb / cd or dvd.
-dantheman (August 02, 2012, 08:01 AM)

@dantheman - Download a free copy of DBan (Darik's Boot and Nuke).

Boot it from a CD or USB, select the "quick" option, and let it run for about 5 minutes. It doesn't need to run to completion for what you want to do.

dantheman · « **Reply #7 on:** August 02, 2012, 08:16 AM »

Tanks 40hz!

Was just googlin' and came upon it.
As i was downloading/burning it, your post came through.

It's good to hear from a friend to confirm a product (and give instructions too!).

mouser · « **Reply #8 on:** August 02, 2012, 08:33 AM »

Some sound advice here.. I guess what I'm hearing is:

For the next big new computer build, make the move to raid.
Because of the unpredictability of failure, the costs+risks vs. benefits of pre-emptively retiring old drives makes it unappealing -- especially compared with the alternative approach of using RAID.

Personally I've avoided RAID for a long time -- mainly because my approach has been to use HD racks, with 3 hard drives in my desktop pc, each serving a different purpose (Operating system, My data, Large Drive for backups, etc.). And I don't think RAID will let me do that.. Though I'd love to hear about a solution that would let me use that kind of 3-drive setup, and give the computer a 4th terrabyte drive and configure it to store raid-like instant redundancy for all the other drives. Anything like that exist?

40hz · « **Reply #9 on:** August 02, 2012, 08:34 AM »

@Mouser - I agree with the majority. There's no real advantage to doing preemptive drive replacements since the increased 'safety' of installing a new drive gets offset by the risk of "infant mortality" that comes with any new electronic component.

Disk drives are like airliners. The greatest likelihood of crashing comes during takeoff or landing. In the case of hard drives most fail very early - or very late in their service life. In-between they're usually just fine.

A simple disk utility will let you monitor how your drive is doing. Too many, or steadily increasing number of errors, is usually a good indication your drive is aging. Same goes for to seeing repairs reported regularly whenever you use chkdsk. If Windows or SMART throws you a warning however, I'd replace the drive as soon as possible since things are starting to get pretty serious by the time SMART squawks about it.

For my clients, we usually stockpile a few quality name brand hard drives. We try to buy them on sale, or whenever we spot a a good deal. As long as they're on-hand we're pretty much covered.

And while it may sound 'unscientific,' I've discovered you usually "just know" when a drive needs to be replaced since most drives don't abruptly fail without giving you some indication that "something is wrong."

Having a replacement drive ready to go, and replacing your old one when you feel something isn't quite right seems to be the best and most reliable way to protect yourself. That and regular backups.

Luck!

TaoPhoenix · « **Reply #10 on:** August 02, 2012, 08:40 AM »

Okay, this topic confuses me on so many levels, so the following are only opinions and should not be taken as advice.

1. What is a "powered up hour" and why is that different from simple elapsed time? My current desktop is doing okay per se, I haven't run any diagnostics but I haven't noticed any Flaking either. So the whole machine was custom built in 2006 by a friend, and it's basically been running steady ever since without any down time.

2. I am not initially inclined to just "randomly" decide "ho hum, it's been seven years, let's just retire my drive". Instead, at the time of designing the system I put in a dedicated backup data drive, opposite the OS, precisely to have resources against a hard drive failure. While my backups are far from as often as might be smart, let's just say that a random week's worth of a full copy-over plan would bring things up to snuff.

3. Let's suppose that my only data copy was on the D aka "Data Drive" opposite the OS on C, wouldn't the C drive with the OS actions and upkeep be the drive that fails first? Wouldn't the data be pretty safe since the D data drive does nearly nothing but sit there?

IainB · « **Reply #11 on:** August 02, 2012, 09:42 AM »

Not sure if this helps or is useful:
Consider the old operational approach to not replace anything unless:

(a) it is becoming a throughput bottleneck for newer/faster technology processors.
(b) it shows signs of impending/potential failure - e.g., per HDsentinel and/or CrystalDiskInfo reports.

Other worthwhile considerations might be:

The RAID approach. (As already mentioned.)
Real-time backups to online backup/mirror hard drives - i.e., each production primary has a trailing secondary drive. I think this would be similar to the old Tandem NonStop approach. The secondary drive would automatically swop in, according to some set of rules, when the primary started showing problems. Built-in redundancy.

40hz · « **Reply #12 on:** August 02, 2012, 09:47 AM »

^From my experience, drives fail when they fail. And each drive has its own probability of failing. Multiple drives actually increase the likelihood of having at least one drive fail in a given system. And a backup drive is no less likely to fail than a main drive.

Perhaps wear & tear from regular use increases the likelihood of a "busy" drive failing. But in my experience it hasn't worked out that way. I strongly suspect variations in manufacturing and quality control have more to do with a drive going south than wear and tear does.

One thing I've observed that does have a direct effect on service life is heat. Cases packed with multiple hard drives, inadequate airflow, and "hot room" environments do experience more drive failures than single-drive PCs in normal office or home environments.

Just my 2¢. YMMV.

mouser · « **Reply #13 on:** August 02, 2012, 10:21 AM »

One thing I've observed that does have a direct effect on service life is heat. Cases packed with multiple hard drives, inadequate airflow, and "hot room" environments do experience more drive failures than single-drive PCs in normal office or home environments.

Yes, that seems to be very well established, which is why I now take hd overheating very seriously, when it's something i never paid attention to before.

In fact, that's why I love Crystal Disk Info, and why I linked to it in my first post -- it lets me put a separate icon for each hard drive temperature in the system tray (and it's free). [If you have only one HD in your computer, there are lots of tray-based hd temperature monitors you could use].

TaoPhoenix · « **Reply #14 on:** August 02, 2012, 11:02 AM »

I just downloaded it now.

I'm getting a "caution" in "reallocated sector count" - what does that mean?

Stoic Joker · « **Reply #15 on:** August 02, 2012, 11:36 AM »

Perhaps wear & tear from regular use increases the likelihood of a "busy" drive failing. But in my experience it hasn't worked out that way. I strongly suspect variations in manufacturing and quality control have more to do with a drive going south than wear and tear does.
-40hz (August 02, 2012, 09:47 AM)

That tracks with my experience also. The head/platter relationship is zero contact, so nothing there can really "wearout". But if the thing is spinning, the bearings are making physical contact, and are indeed wearing slowly out.

Only exception I can think of is if something is being badly overused so the read/write lifetime is exceeded in a localized fashion. Like an XP SP2 machine with 256MB RAM from the factory ... They seemed to have an odd habit of blowing a hole in the drive wear the pagefile wasn't anymore.

Stoic Joker · « **Reply #16 on:** August 02, 2012, 11:39 AM »

I'm getting a "caution" in "reallocated sector count" - what does that mean?
-TaoPhoenix (August 02, 2012, 11:02 AM)

You are running out of spare sectors, which is one of the signs of the end for the drive.

TaoPhoenix · « **Reply #17 on:** August 02, 2012, 01:07 PM »

I'm getting a "caution" in "reallocated sector count" - what does that mean?
-TaoPhoenix (August 02, 2012, 11:02 AM)

You are running out of spare sectors, which is one of the signs of the end for the drive.
-Stoic Joker (August 02, 2012, 11:39 AM)

Maybe you can give me a little more detail. Like xkcd's jokes about graphs with no axis labels, the caution drive is listed at "97" (of what?) reallocated sectors, with the "threshold of 36" (of what?). But the data drive is listed as "good" with reallocated sectors at "100" (of what?) with a "threshold of 5". Why So Different? So how is one Caution and the other good?

Stoic Joker · « **Reply #18 on:** August 02, 2012, 01:27 PM »

I'm getting a "caution" in "reallocated sector count" - what does that mean?
-TaoPhoenix (August 02, 2012, 11:02 AM)

You are running out of spare sectors, which is one of the signs of the end for the drive.
-Stoic Joker (August 02, 2012, 11:39 AM)

Maybe you can give me a little more detail. Like xkcd's jokes about graphs with no axis labels, the caution drive is listed at "97" (of what?) reallocated sectors, with the "threshold of 36" (of what?). But the data drive is listed as "good" with reallocated sectors at "100" (of what?) with a "threshold of 5". Why So Different? So how is one Caution and the other good?
-TaoPhoenix (August 02, 2012, 01:07 PM)

I was responding to only the "caution" and "reallocated sector" parts. I know nothing about the software other than this morning I learned that Mouser thinks it's really cool. HDDs keep a pool of spare sectors that can be reallocated when one of the normal use sectors (fails/dies/expires) goes poof. Once that starts happening it's generally best to go shopping for a replacement. The intricate details involving how many are where and what's the magic number for too many I've not a clue on ... I'm a network guy...not a hardware guy. But I have seen a lot of disk failures, and they're generally not real slow...

40hz · « **Reply #19 on:** August 02, 2012, 01:32 PM »

^all modern hard drives ship with more usable data sectors than advertised. These extra sectors are used to hold data automatically moved from sectors that exceed read/write validation error thresholds. Think of it as a built in spare drive area. Once a sector gets reallocated it's old location is marked as bad and never reused.

When you start running out of spare sectors it means more and more live sectors are experiencing serious read/write errors and being removed from allocable storage space. If you run out of spare sectors you are at risk of data loss since there won't be any place to move your data before the sector it is on gets a "hard" or non-recoverable read/write error.

Small numbers of such errors are normal and usually due to problems with the media. Large numbers, or increasing numbers, are more usually caused by the mechanical part of the read/write mechanism wearing out or going out of calibration. That's much more serious because that affects the entire drive.

Hope that explains things.

Stoic Joker · « **Reply #20 on:** August 02, 2012, 01:53 PM »

^So what's your take on the programs output, should he draw three, hold, or fold?

40hz · « **Reply #21 on:** August 02, 2012, 02:11 PM »

When in doubt about that sort of error and data is involved? Replace & migrate. Period. No question.

Or if not, at least check the stats regularly. When these things go they tend to warn you a little but then deteriorate fairly rapidly.

(A drive loaded with files "for the discriminating and advanced collector" takes a lot of time to redownload. And the torrents are starting to get a little dicey. )

MilesAhead · « **Reply #22 on:** August 02, 2012, 02:49 PM »

Hmmm, I think preemptive replacement is an excellent idea. In order to facilitate this new trend I propose any currently working Sata II or Sata III drive over 200 GB be sent to me for green disposal procedures. I "dispose" of them by sticking in a drawer until I need to use 'em in a docking station. The "green" is what I'll save not buying drives at the current gouge prices.

mouser · « **Reply #23 on:** August 02, 2012, 03:05 PM »

f0dder · « **Reply #24 on:** August 02, 2012, 03:12 PM »

One thing I've observed that does have a direct effect on service life is heat. Cases packed with multiple hard drives, inadequate airflow, and "hot room" environments do experience more drive failures than single-drive PCs in normal office or home environments.
-40hz (August 02, 2012, 09:47 AM)

IIRC Google's big harddrive failure report claimed heat wasn't that important wrt. drive death, but it's certainly been my experience as well - but perhaps Google's report measured ambient case temperature and didn't have a lot of drives crammed close together, which would mean pretty hefty temperature hotspots?

Maybe you can give me a little more detail. Like xkcd's jokes about graphs with no axis labels, the caution drive is listed at "97" (of what?) reallocated sectors, with the "threshold of 36" (of what?). But the data drive is listed as "good" with reallocated sectors at "100" (of what?) with a "threshold of 5". Why So Different? So how is one Caution and the other good?
-TaoPhoenix (August 02, 2012, 01:07 PM)

The CrystalDiskInfo program sucks, IMHO, since it shows raw S.M.A.R.T numbers - and those are damn close to meaningless. You need an application that has knowledge of specific brands and can translate the values to something meaningful. (Sigh, "standards" - when can they ever get anything right?)

As soon as the re-allocated sector count (from a program that can correctly display SMART data) goes non-zero, replace the drive. Sure, I've had drives that lasted years after a few reallocated sectors, but it's an indication that you're getting disk errors - at best you risk minor data corruption, at worst the drive goes CLUNK from one day to the next.

It's also worth keeping in mind that drives only reallocate sectors on disk writes - so just attempting to read a bad sector will not cause it to get re-allocated. Thus, it's not the only stat you need to look at - another interesting one is the number of DMA errors. Those can be an indication of bad SATA cables (which is also worrysome), but can definitely also be a sign of drives that are about to die.

Personally, I don't switch out drives before they show signs of being about to die - whether the two aforementioned stats, or "stuff feeling wonky" (machine being slower or even stalling on disk I/O, or drives making noises they don't usually do). Be sure to raid-mirror your important stuff, and also do backups.

Don't even consider other RAID forms than mirroring. Yes, you "waste" a lot of space with mirroring compared to *-5 or *-6 modes, but rebuilding a mirror is a simple linear copy, whereas rebuilding the more complex forms of raid have more points of failure, and are more intensive on the involved disks. I've heard more than one story of people losing their entire raid-5 arrays... and not having backups (they built the arrays for ZOMG HUGE SIZE, and thought that *-5 really couldn't fail, thus treated it *as* backup...)

Also, a quick mention on SSDs... back up those things even more vigilantly than mechanical drives. Yes, in theory those flash cells should wear out gracefully, and even the MLC variants should last quite a bit longer under normal use than a mechanical disk. Funny thing is, though, that they don't. Or rather, the flash cells don't wear out, but either the firmware goes into retardo-mode (known to happen frequently with SandForce based drives), or other parts of the electronics just go frizzle. And then you're SOL. Really, bigtime SOL. At least with mechanical drives, you can send them off to data recovery services if the data was important enough... much less likely to be able to do that with SSDs, especially with the ones that have hardware encryption.

Me and a classmate had our Vertex2 SSDs die a few weeks apart, after... what, a month or so use? And my Intel X25-E (their ENTERPRISE SLC-based drive) died last month, after a few years of non-intensive use... I'm sure the SLC cells would have several years lifetime left, so it's probably some electronics that went fizzle. Scary that an enterprise drive dies like that :-(