I also had two mirrored drives fail simultaneously in my zpool a few days ago. There was nothing on them so I wasn't worried. WD Reds in my case;
Using matched drives seems to be a very bad idea for mirrors. I'll probably replace them with two different brands.
I also have a matched pair of HGST SAS Helium drives in the same backplane so hopefully I can catch those before they fail too if they're going to go at once, I _do_ have data on those.
Buying two identical drives has high chances of them being from a single batch, which makes them physically almost identical. It’s a pretty well-known raid-related fact, but some people aren’t aware of it or don’t take it seriously.
If they're bought together, like mine were, and they have close serials, they've be almost identical; if you then run them in a ZFS mirror like I was, they'll receive identical "load" as well.
Since mine had ~43000 hours, they didn't fail prematurely, they just aged out, and since they appear to have been built pretty well, they both aged out at the same time. Annoying for a ZFS mirror, but indicates good quality control in my opinion.
If they're ~identical construction and being mirrored so that they have the same write/read pattern history, it could trigger the same failure mode simultaneously.
Why bad? What's considered a good/bad lifetime for these? Mine had ~43000 power on hours, I don't know if that's good or bad for a WD Red (CMR) drive, but they weren't particularly heavily loaded, and their temps were good, so I'm fairly happy with how long they lasted (though longer would have been nice).
He didn't have a raid array on his laptop, that would have saved him as well, with not even a second worth of data lost.
And he could use his system without replacing any drive and he would have had a much quicker recovery and could have used the system during rebuild.
Now, raid on a laptop is mostly reserved to bigger units. And authors setup is awesome as well, but single drive failure is one of the most common issues (especially where you make use of snapshots) you need your backup and that is exactly what raid solves.
I've frequently had drives in a RAID fail in rapid succession. If you buy a bunch of identical drives at the same time and put them in a RAID, then you can end up with:
* They were manufactured in the same batch, maybe even one right after another on the same line.
* As they were transported from manufacturer to OEM to you they were exposed the the same environmental conditions, right down to vibrations, humidity, and ambient EM environment.
* As you use them, they continue to be exposed to the same environmental conditions, including power supply fluctuations and power inductively coupled into places it doesn't belong.
* They see the same usage patterns. Depending on the RAID specifics, that might be right down to seeing the same disk locations seeing the same read and write volume.
Its then not surprising if they fail at about the same time.
The last machine I put together that I wanted to have high availability, I intentionally bought two different brand drives to put in the mirror to maximize the likelihood that they fail at very different times.
Many years ago (c. 2003) the group I was working in inherited a massive 6U storage server with an insane number of 10k SCSI (it was before SAS was a thing) drives. We named it "hurricane" for the sound it made. After a few weeks of using it, the first drive failed. It rebuilt to a hot spare and we ordered and eventually installed a replacement. A few weeks later, another drive failed, and this time before it could finish rebuilding, two more drives in the RAID failed and its contents lost (but we had a good backup). We never used it again. For a while I used it as a coffee table, but then someone convinced me that was too tacky, and it got ewasted.
> Its then not surprising if they fail at about the same time.
It is, but in a different way. It is a testament to the depth and precision of manufacturing process control, that two insanely complex machines will behave nearly identically for years, up to the point of failing at about the same time, if they've been made in the same batch, and exposed to about the same environment and usage patterns over those years. You'd expect any number of random factors to cause one drive fail way before the other, but no - not only there is very little variation between drives in a batch, tiny variations in usage are damped down instead of amplified.
> If you buy a bunch of identical drives at the same time and put them in a RAID
When setting up a new machine with zfs I intentionally buy drives from as many different brands and models as possible to spread the manufacturing defect risk.
Not extremely unlikely if they were identical drives from the same manufacturing batch. It's good practise to use diverse manufacturers or at least batches when adding disks to a raid array for just this reason.
It's not unreasonable to believe that if you pick two identical products off the same shelf at the same time (as one would logically do when purchasing 2 of a single item), that the two products were manufactured at similar times and in similar conditions.
Your model isn't exactly bad, but there is an assumption being made that you haven't accounted for. Which to be fair, is frequently not stated. The assumption is that the drives defects are independent of one another. This is a poor assumption when manufactured back to back.
https://news.ycombinator.com/item?id=32026606 Hacker news went down a while back because of the 40k hour bug. Both the primary and backup servers were placed into service at the same time with ssd's that had an overflow after ~40k hours.
Ah, don't be scared. You're at least starting to think on your data and replicating your most important stuff elsewhere. I'd still recommend a non-cloud copy also, but you're probably okay :)
Take it as a good impetus to catalog your data and find those extra replication options. Data protection does cost you a bit to do it "fully", but since I've worked on a backup solution before in a client facing way, trust me when I tell you that I've seen rather large businesses (a few you might even know as a household name) who have less consideration for their data than you've expressed in your 3 sentences :)
So just figure out which of your data _truly_ needs to survive at all costs, get a solid setup with personally owned storage for the backups in combination with cloud storage, and you're probably fine.
The thing that has been worrying me lately with tens of TB of files I need to look after is how do I know the files haven't silently got corrupted somehow? I feel like I need to periodically re-checksum everything and keep hashes in a database somewhere off the side.
I think rsync has a --checksum flag and also supports incremental backups, so probably based on your situation you can understand a logic that represents your access patterns. That is, if you know for sure you personally will not touch the files after 20:00 and no one else will, it's _fairly_ reasonable to assume that if you start rsync after you're done working, then schedule an incremental pass a bit before you start working, anything that was "copied" to your new backup has likely changed in an unintended way, and the list should be pretty small for you to check each morning, if there's anything. Keep in mind stuff like OS deduplication may mess with this (Windows' Dedup would undoubtedly break this schema as it may show that the files are modified after the Optimization job runs)
Alternatively, consider just using a file system that does periodic integrity checks. I know ReFS has integrity streams, and I am pretty sure XFS has something almost exactly the same or better. It won't prevent corruption, but it will give you something you can monitor for when the filesystem reports an issue.
Some combination like this should work.
Similarly, you might be able to come up with a fast trick using stat; with some quick testing on a dummy file in MacOS' ZFS shell, you can do something like:
stat -f %m somedir/*
and compare the resulting value by passing it to sum or something. I am not super familiar with stat in general, so likely I am missing elements that make this unreliable, but I'd consider looking into it further unless someone tells me it's 100% the wrong direction and explains why.
Note that rsync generally runs far, far more slowly with the --checksum flag. And I can't recall it ever saying something like "change(s) in $FileName were only noticed by checksum", so I would have been alerted to quiet disk corruption.
ZFS has checksums, and the 'zpool scrub' command tells it to verify those (on all copies of your data, if you're using RAID).
> Google assumes you’re looking for product pages when you search for things like “4TB SanDisk SSD,” so news stories like ours and Ars Technica’s appear far down search results.
Well, I've been test-driving the paid Kagi search engine, and this was an excellent opportunity to see if a different class of web search could produce different results...
Sadly I'm afraid when all the web is flooded by praising articles, a different set of prioritization rules would still struggle to show different results:
* First 5-6 results are from shops.
* Then come some reviews, from: easeus [1], techpowerup [2], anandtech [3], consumerreviews [4]. None of them contain the word "fail".
* Lastly, and this is the only actual improvement from google, there are some relevant search suggestions such as "sandisk 4tb ssd failure" and "sandisk 4tb ssd problems". Difficult to see (they are at the bottom of the page), but at least better than Google (where the word "fail" doesn't appear at all in the first results page).
https://petapixel.com/2023/08/08/sandisk-portable-ssds-are-f...
https://www.theverge.com/22291828/sandisk-extreme-pro-portab...
https://news.ycombinator.com/item?id=37042587
https://www.theverge.com/23837513/western-digital-sandisk-ss...
https://news.ycombinator.com/item?id=37188736