Only if your RAID software/hardware is exceptionally tolerant of the drive.
Don't forget the reason "NAS" drives exist in the first place is that several years ago drive manufacturers added a feature where if a read failed they would go into an extremely thorough but long (60+ second) recovery effort to get the sector back. RAID controllers would just see the drive stop responding to commands and mark it dead. So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.
If the drives go out to lunch due to a SMR writeback bottleneck then they will have lost their main selling point. Presumably in the normal case the drive will write the data just fine, but at a slower rate so you can rebuild your array but it will take all week. However, if one of the sectors fails the CRC check after the write and it has to try several times to get it I can definitely see the RAID controller getting frustrated and kicking it out.
I would be interested to see if any RAID software comes with a "SMR" mode where if a drive stops responding to commands during a rebuild the controller lets the drive take a 20 minute break before resuming the rebuild.
> So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.
Hang on a sec. Is this documented somewhere?
I bought a WD Red to plug into my Raspberry Pi which I use as a file server. There's no RAID, just the one disk. I thought I was buying a more energy efficient or bulk-storage-oriented drive.
But if what you say is true, then the "NAS" or "Red" drives should _never_ be used outside of a RAID because robust error correction was removed from them by design. Do I have that right?
Basically, NAS drives have a hard limit on how long they'll try to recover from errors before just reported the failure back to the RAID controller so that it can handle them.
Yes, that's the right idea. NAS/RAID drives have a different error recovery strategy, because the assumption is that they'll be part of a multi-drive arrangement where failing fast (and allowing the containing system to handle recovery) is preferable to avoiding failure if at all possible (but potentially taking a long time and thus causing the containing system to think the drive has stopped functioning properly and fail the whole thing out). I can't point you to any specific documentation off the top of my head, but this is a well-known position that I've seen described explicitly several times.
I'm afraid that does mean your choice of a Red for a single-disk system was not ideal. Presumably you keep backups of any valuable data anyway, but if downtime for recovery would be a significant problem for you then you might want to consider replacing that drive with something more suitable for your situation.
I should hand in my geek card, this feels like something I should have known about. In my defense, though, the HD manufacturers offer little to no information about the _technical_ differences between their drive lines. All of their documentation just says, "designed for X use case".
I do have backups, that's not my concern. My concern is that _when_ there is a read/write error (which are completely normal events with today's hard drive technology), the drive just gives up right away instead of making a few attempts. This could easily translate into (silently!) lost data in a single-disk scenario.
If one uses ZFS, one can instruct ZFS to keep multiple copies of the data. It will try to spread those copies among multiple disks, but in single-disk systems it will just spread the duplicate blocks over that disk.
Since ZFS does checksum verification on every read, it has a much better chance of recovering from a few bad sectors.
Downside though is that the default RPi installs are 32bit and ZFS was written with 64bit-only in mind, and AFAIK there are still some issues and limitations when running on a 32bit system.
If you are ever where this happens the drive is end of life and should be tossed and the new one rebuilt from backups. You do have backups right... Drives fail without notice often
Wait, if it's a NAS drive, the drive firmware will ensure that it doesn't timeout due to media failure. Which the RAID can trust, because it's a NAS drive.
So.. why do the RAID rebuilds have timeouts on NAS drives at all? If you paid all that extra money for a special firmware that doesn't time out on media error, and the drive is still accepting and processing commands in less than X hours per command, then wiring in your own timeout seems like a really bad idea.
When the cache is full and something sends a write to the drive, does the drive still accept "are you still there?" commands while the write is queued?
So the raid software thinks it knows better than the drive firmware, ignores the fact that it's operating a drive with no I/O timeouts, and helpfully times out the drive from the array because obviously it's not behaving 'correctly' in line with the unverified assumptions of the RAID software?
It reads to me like the fault here isn't just on the hard drive manufacturers, like everyone's made it appear in top-level comments of both issues about it this week. I'm glad I asked more questions so that I'm better informed to help my friends when they encounter this. I appreciate everyone in the thread offering help with the details.
Don't forget the reason "NAS" drives exist in the first place is that several years ago drive manufacturers added a feature where if a read failed they would go into an extremely thorough but long (60+ second) recovery effort to get the sector back. RAID controllers would just see the drive stop responding to commands and mark it dead. So the NAS drives come with firmware that doesn't do the extreme recovery and instead just returns "read error" and lets the RAID controller rebuild it with the parity information.
If the drives go out to lunch due to a SMR writeback bottleneck then they will have lost their main selling point. Presumably in the normal case the drive will write the data just fine, but at a slower rate so you can rebuild your array but it will take all week. However, if one of the sectors fails the CRC check after the write and it has to try several times to get it I can definitely see the RAID controller getting frustrated and kicking it out.
I would be interested to see if any RAID software comes with a "SMR" mode where if a drive stops responding to commands during a rebuild the controller lets the drive take a 20 minute break before resuming the rebuild.