With ZFS, he is better prepared than other WD and Sandisk SSD users. https://pet...

anonuser123456 · on Aug 23, 2023

I had two drives in my mirrored zpool die within 8 minutes of one another.

Both HGST drives too. A very sad day.

Thankfully I had been regularly zfs sending my contents to another site and lost very little data.

ZFS is rad.

alias_neo · on Aug 23, 2023

I also had two mirrored drives fail simultaneously in my zpool a few days ago. There was nothing on them so I wasn't worried. WD Reds in my case;

Using matched drives seems to be a very bad idea for mirrors. I'll probably replace them with two different brands.

I also have a matched pair of HGST SAS Helium drives in the same backplane so hopefully I can catch those before they fail too if they're going to go at once, I _do_ have data on those.

RealStickman_ · on Aug 23, 2023

Alternatively, you can also buy your drives a few months apart to get different batches, most likely.

alias_neo · on Aug 23, 2023

Yeah, that's an option; though since my pair failed at once, I need to buy at least two immediately to get the mirror back up.

hackmiester · on Aug 24, 2023

Because of this, I always buy a different brand as well.

alias_neo · on Aug 24, 2023

Any limitations, or am I good so long as they're the same label capacity? I assume I just lose a few MB or so from whichever is a little larger?

miohtama · on Aug 23, 2023

Is there a specific reason why the drives die at the same time? Electricity spike?

wruza · on Aug 23, 2023

Buying two identical drives has high chances of them being from a single batch, which makes them physically almost identical. It’s a pretty well-known raid-related fact, but some people aren’t aware of it or don’t take it seriously.

short_throw · on Aug 23, 2023

Identical twins may both die of a heart attack, but not usually at the same time.

Normally, failures come from some amount of non-repeatability or randomness that the systems weren't robust to.

The drive industry is special (in a bad way) in that they can exactly reproduce their flaws, and most people's intuition isn't prepared for that.

alias_neo · on Aug 23, 2023

If they're bought together, like mine were, and they have close serials, they've be almost identical; if you then run them in a ZFS mirror like I was, they'll receive identical "load" as well.

Since mine had ~43000 hours, they didn't fail prematurely, they just aged out, and since they appear to have been built pretty well, they both aged out at the same time. Annoying for a ZFS mirror, but indicates good quality control in my opinion.

privong · on Aug 23, 2023

If they're ~identical construction and being mirrored so that they have the same write/read pattern history, it could trigger the same failure mode simultaneously.

dizhn · on Aug 23, 2023

More likely to be from the same bad batch too. There was a post with very detailed comments about this just a few days ago.

alias_neo · on Aug 23, 2023

Why bad? What's considered a good/bad lifetime for these? Mine had ~43000 power on hours, I don't know if that's good or bad for a WD Red (CMR) drive, but they weren't particularly heavily loaded, and their temps were good, so I'm fairly happy with how long they lasted (though longer would have been nice).

dizhn · on Aug 23, 2023

You're right it might be a natural end of life that coincides too.

anotherhue · on Aug 23, 2023

> ZFS is rad. Typo: RAID

quickthrower2 · on Aug 23, 2023

Redundant Array of Disks (cost not specified)

askvictor · on Aug 23, 2023

I always thought it was Independent Disks, though how one disk could be dependent on another is beyond me. Perhaps the I is redundant?

throw0101b · on Aug 23, 2023

The original Patterson, Gibson, Katz paper has "Inexpensive":

* https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

* https://www.computerhistory.org/storageengine/u-c-berkeley-p...

quickthrower2 · on Aug 23, 2023

Looking it up I think you are right. I thought it was inexpensive.

josephg · on Aug 23, 2023

Repeat after me: RAID is not a backup.

Sounds like it was the backup which saved the day here, not the raid array.

tjoff · on Aug 23, 2023

He didn't have a raid array on his laptop, that would have saved him as well, with not even a second worth of data lost.

And he could use his system without replacing any drive and he would have had a much quicker recovery and could have used the system during rebuild.

Now, raid on a laptop is mostly reserved to bigger units. And authors setup is awesome as well, but single drive failure is one of the most common issues (especially where you make use of snapshots) you need your backup and that is exactly what raid solves.

So try to do both :)

sofixa · on Aug 23, 2023

> I had two drives in my mirrored zpool die within 8 minutes of one another.

2 drives in a mirrored zpool, this is the equivalent of RAID1.

tjoff · on Aug 23, 2023

Missed that context, meant regarding the article.

taskforcegemini · on Aug 23, 2023

zfs send was used, so the failed zfs still had a hand in the positive outcome

btown · on Aug 23, 2023

Only 1/4 of data lost! Meaning still recoverable!

LargoLasskhyfv · on Aug 24, 2023

I'd bet that was intentional, because https://www.urbandictionary.com/define.php?term=Rad

anonuser123456 · on Aug 23, 2023

You deserve all my upvotes :)

vardump · on Aug 23, 2023

That's extremely unlikely. Could it have been the controller instead? Which HGST drives?

repiret · on Aug 23, 2023

I've frequently had drives in a RAID fail in rapid succession. If you buy a bunch of identical drives at the same time and put them in a RAID, then you can end up with:

* They were manufactured in the same batch, maybe even one right after another on the same line.

* As they were transported from manufacturer to OEM to you they were exposed the the same environmental conditions, right down to vibrations, humidity, and ambient EM environment.

* As you use them, they continue to be exposed to the same environmental conditions, including power supply fluctuations and power inductively coupled into places it doesn't belong.

* They see the same usage patterns. Depending on the RAID specifics, that might be right down to seeing the same disk locations seeing the same read and write volume.

Its then not surprising if they fail at about the same time.

The last machine I put together that I wanted to have high availability, I intentionally bought two different brand drives to put in the mirror to maximize the likelihood that they fail at very different times.

Many years ago (c. 2003) the group I was working in inherited a massive 6U storage server with an insane number of 10k SCSI (it was before SAS was a thing) drives. We named it "hurricane" for the sound it made. After a few weeks of using it, the first drive failed. It rebuilt to a hot spare and we ordered and eventually installed a replacement. A few weeks later, another drive failed, and this time before it could finish rebuilding, two more drives in the RAID failed and its contents lost (but we had a good backup). We never used it again. For a while I used it as a coffee table, but then someone convinced me that was too tacky, and it got ewasted.

TeMPOraL · on Aug 23, 2023

> Its then not surprising if they fail at about the same time.

It is, but in a different way. It is a testament to the depth and precision of manufacturing process control, that two insanely complex machines will behave nearly identically for years, up to the point of failing at about the same time, if they've been made in the same batch, and exposed to about the same environment and usage patterns over those years. You'd expect any number of random factors to cause one drive fail way before the other, but no - not only there is very little variation between drives in a batch, tiny variations in usage are damped down instead of amplified.

It truly is amazing.

jjav · on Aug 23, 2023

> If you buy a bunch of identical drives at the same time and put them in a RAID

When setting up a new machine with zfs I intentionally buy drives from as many different brands and models as possible to spread the manufacturing defect risk.

jms · on Aug 23, 2023

Not extremely unlikely if they were identical drives from the same manufacturing batch. It's good practise to use diverse manufacturers or at least batches when adding disks to a raid array for just this reason.

godelski · on Aug 23, 2023

It's not unreasonable to believe that if you pick two identical products off the same shelf at the same time (as one would logically do when purchasing 2 of a single item), that the two products were manufactured at similar times and in similar conditions.

Your model isn't exactly bad, but there is an assumption being made that you haven't accounted for. Which to be fair, is frequently not stated. The assumption is that the drives defects are independent of one another. This is a poor assumption when manufactured back to back.

birdman3131 · on Aug 23, 2023

https://news.ycombinator.com/item?id=32026606 Hacker news went down a while back because of the 40k hour bug. Both the primary and backup servers were placed into service at the same time with ssd's that had an overflow after ~40k hours.

anonuser123456 · on Aug 23, 2023

The drives themselves were toast. My hypothesis was a short in the raid controller or something leading to an over current in the drives.

I wasn’t using them in a RAID configuration, but they were attached to a raid controller.

numpad0 · on Aug 23, 2023

Or could be that the tolerances and environmental factors were so tighly matched between the drives.

erhaetherth · on Aug 23, 2023

Stop, you're scarring me! I have mirrored drives in a zpool. If a pair dies I lose 18 TB. My most important stuff is cloud-replicated but still...

densh · on Aug 23, 2023

My pairs of mirrored drives come from 2 different manufacturers to prevent a common fault that happens at once to both drives.

csydas · on Aug 23, 2023

Ah, don't be scared. You're at least starting to think on your data and replicating your most important stuff elsewhere. I'd still recommend a non-cloud copy also, but you're probably okay :)

Take it as a good impetus to catalog your data and find those extra replication options. Data protection does cost you a bit to do it "fully", but since I've worked on a backup solution before in a client facing way, trust me when I tell you that I've seen rather large businesses (a few you might even know as a household name) who have less consideration for their data than you've expressed in your 3 sentences :)

So just figure out which of your data _truly_ needs to survive at all costs, get a solid setup with personally owned storage for the backups in combination with cloud storage, and you're probably fine.

foobarian · on Aug 23, 2023

The thing that has been worrying me lately with tens of TB of files I need to look after is how do I know the files haven't silently got corrupted somehow? I feel like I need to periodically re-checksum everything and keep hashes in a database somewhere off the side.

csydas · on Aug 23, 2023

I think rsync has a --checksum flag and also supports incremental backups, so probably based on your situation you can understand a logic that represents your access patterns. That is, if you know for sure you personally will not touch the files after 20:00 and no one else will, it's _fairly_ reasonable to assume that if you start rsync after you're done working, then schedule an incremental pass a bit before you start working, anything that was "copied" to your new backup has likely changed in an unintended way, and the list should be pretty small for you to check each morning, if there's anything. Keep in mind stuff like OS deduplication may mess with this (Windows' Dedup would undoubtedly break this schema as it may show that the files are modified after the Optimization job runs)

Alternatively, consider just using a file system that does periodic integrity checks. I know ReFS has integrity streams, and I am pretty sure XFS has something almost exactly the same or better. It won't prevent corruption, but it will give you something you can monitor for when the filesystem reports an issue.

Some combination like this should work.

Similarly, you might be able to come up with a fast trick using stat; with some quick testing on a dummy file in MacOS' ZFS shell, you can do something like:

stat -f %m somedir/*

and compare the resulting value by passing it to sum or something. I am not super familiar with stat in general, so likely I am missing elements that make this unreliable, but I'd consider looking into it further unless someone tells me it's 100% the wrong direction and explains why.

bell-cot · on Aug 24, 2023

Note that rsync generally runs far, far more slowly with the --checksum flag. And I can't recall it ever saying something like "change(s) in $FileName were only noticed by checksum", so I would have been alerted to quiet disk corruption.

ZFS has checksums, and the 'zpool scrub' command tells it to verify those (on all copies of your data, if you're using RAID).

orangepurple · on Aug 23, 2023

You should probably be using RAID Z2 in that case which supports simultaneous 2 drive failure without a problem

benoliver999 · on Aug 23, 2023

I actually had this happen, and RAID Z2 saved me from a very long recovery process.

I thought it might be the controller but a year on I've had no further issues. Sometimes drives do just go like lightbulbs.

teekert · on Aug 23, 2023

3 2 1 backups, ddg it ;)

nicman23 · on Aug 23, 2023

i hope me mixing the models and buy times (i have been using the raidz expansion branch) will keep that raidz alive.

it has become non economical / practical for me to backup everything

j1elo · on Aug 23, 2023

Tangentially related, on topic with the Google search issues that The Verge complains about at https://www.theverge.com/22291828/sandisk-extreme-pro-portab...

> Google assumes you’re looking for product pages when you search for things like “4TB SanDisk SSD,” so news stories like ours and Ars Technica’s appear far down search results.

Well, I've been test-driving the paid Kagi search engine, and this was an excellent opportunity to see if a different class of web search could produce different results...

Sadly I'm afraid when all the web is flooded by praising articles, a different set of prioritization rules would still struggle to show different results:

* First 5-6 results are from shops.

* Then come some reviews, from: easeus [1], techpowerup [2], anandtech [3], consumerreviews [4]. None of them contain the word "fail".

* Lastly, and this is the only actual improvement from google, there are some relevant search suggestions such as "sandisk 4tb ssd failure" and "sandisk 4tb ssd problems". Difficult to see (they are at the bottom of the page), but at least better than Google (where the word "fail" doesn't appear at all in the first results page).

[1]: https://www.easeus.com/knowledge-center/sandisk-4tb-extreme-...

[2]: https://www.techpowerup.com/review/sandisk-ultra-3d-4-tb-ssd...

[3]: https://www.anandtech.com/show/16892/sandisk-extreme-pro-cru...

[4]: https://consumerreviews.store/sandisk-4tb-extreme-portable-s...