Only if your RAID software/hardware is exceptionally tolerant of the drive. Don'...

bityard · on April 24, 2020

> So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.

Hang on a sec. Is this documented somewhere?

I bought a WD Red to plug into my Raspberry Pi which I use as a file server. There's no RAID, just the one disk. I thought I was buying a more energy efficient or bulk-storage-oriented drive.

But if what you say is true, then the "NAS" or "Red" drives should _never_ be used outside of a RAID because robust error correction was removed from them by design. Do I have that right?

kstrauser · on April 24, 2020

That's exactly correct: see https://en.wikipedia.org/wiki/Error_recovery_control for details.

Basically, NAS drives have a hard limit on how long they'll try to recover from errors before just reported the failure back to the RAID controller so that it can handle them.

Silhouette · on April 24, 2020

Yes, that's the right idea. NAS/RAID drives have a different error recovery strategy, because the assumption is that they'll be part of a multi-drive arrangement where failing fast (and allowing the containing system to handle recovery) is preferable to avoiding failure if at all possible (but potentially taking a long time and thus causing the containing system to think the drive has stopped functioning properly and fail the whole thing out). I can't point you to any specific documentation off the top of my head, but this is a well-known position that I've seen described explicitly several times.

I'm afraid that does mean your choice of a Red for a single-disk system was not ideal. Presumably you keep backups of any valuable data anyway, but if downtime for recovery would be a significant problem for you then you might want to consider replacing that drive with something more suitable for your situation.

bityard · on April 25, 2020

I should hand in my geek card, this feels like something I should have known about. In my defense, though, the HD manufacturers offer little to no information about the _technical_ differences between their drive lines. All of their documentation just says, "designed for X use case".

I do have backups, that's not my concern. My concern is that _when_ there is a read/write error (which are completely normal events with today's hard drive technology), the drive just gives up right away instead of making a few attempts. This could easily translate into (silently!) lost data in a single-disk scenario.

magicalhippo · on April 25, 2020

If one uses ZFS, one can instruct ZFS to keep multiple copies of the data. It will try to spread those copies among multiple disks, but in single-disk systems it will just spread the duplicate blocks over that disk.

Since ZFS does checksum verification on every read, it has a much better chance of recovering from a few bad sectors.

Downside though is that the default RPi installs are 32bit and ZFS was written with 64bit-only in mind, and AFAIK there are still some issues and limitations when running on a 32bit system.

flyinghamster · on April 24, 2020

You can set the TLER value via smartctl, though that might not work through a USB interface.

smartctl -l scterc,<READTIME>,<WRITETIME> /dev/xxxx

WD Red drives should retain this setting across reboots. Some drives don't, and some don't support the command.

pkaye · on April 24, 2020

I think this is called TLER (Time Limited Error Recovery.)

https://en.wikipedia.org/wiki/Error_recovery_control

bluGill · on April 24, 2020

If you are ever where this happens the drive is end of life and should be tossed and the new one rebuilt from backups. You do have backups right... Drives fail without notice often

floatingatoll · on April 24, 2020

Wait, if it's a NAS drive, the drive firmware will ensure that it doesn't timeout due to media failure. Which the RAID can trust, because it's a NAS drive.

So.. why do the RAID rebuilds have timeouts on NAS drives at all? If you paid all that extra money for a special firmware that doesn't time out on media error, and the drive is still accepting and processing commands in less than X hours per command, then wiring in your own timeout seems like a really bad idea.

When the cache is full and something sends a write to the drive, does the drive still accept "are you still there?" commands while the write is queued?

Wowfunhappy · on April 24, 2020

> Which the RAID can trust, because it's a NAS drive.

There's the problem right there! :) Drives aren't trustworthy, regardless of label.

floatingatoll · on April 25, 2020

So the raid software thinks it knows better than the drive firmware, ignores the fact that it's operating a drive with no I/O timeouts, and helpfully times out the drive from the array because obviously it's not behaving 'correctly' in line with the unverified assumptions of the RAID software?

It reads to me like the fault here isn't just on the hard drive manufacturers, like everyone's made it appear in top-level comments of both issues about it this week. I'm glad I asked more questions so that I'm better informed to help my friends when they encounter this. I appreciate everyone in the thread offering help with the details.