Hacker News new | past | comments | ask | show | jobs | submit login

ZFS is also crazy good on surviving disks with bad sectors (as long as they still respond fast). Check out this paper: https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...

They even spread the metadata across the disk by default. I'm running on some old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ just fine.

There is also failmode=continue where ZFS doesn't hang when it can't read something. If you have a distributed layer above ZFS that also checksums (like HDFS) you can go pretty far even without RAID and quite broken disks. There is also copies=n. When ZFS broke, the disk usally stopped talking or died a few days later. btrs, ext4 just choke and remount ro quite fast (probably the best and correct course of action) but you can tell ZFS to just carry on! Great piece of engineering!




Pretty fascinating. But just based on this comment, I reckon that these 1500+ bad sectors drives aren't worth your time. So, why? Is it just that you wanted to play with all these options and don't really care about the data on these drives, or do you actually believe it's reasonable bang for the buck?


I forget the disclaimer that you should not do this, ever :)

We had a cluster for Hadoop experiments at uni and no ressources to replace all the faulty disks at that time (20-30% were faulty to some degree from the SMART values - more than 150 disks). So this was kind of an experiment. All used data was available and backup up ouside of that cluster. The problem was that with ext4 after running a job certain disks always switched to readonly and this was a major hassle as this node had to be touched by hand. HDFS ist 3x replicated and checksummed and the disks usally worked fine for quite a time after the first bad sector. So we switched to ZFS, ran weekly scrubs - only replaced disks that didn't survived the srub in reasonable time or with reasonable failure rates and bumped up the HDFS checksum reads that everything is control read once a week. The working directory for the layer above (MapReduce and stuff like that) got a dataset with copies=2 so that intermediate data is still fine within reasonable amounts. This was for learning or research purposes where top speed or 100% integrity didn't matter and uptime and usability was more important. Basically the metadata on disk had to be sound and the data on a single disk didn't matter that much. This was quite a ride and it's long been replaced since then.

Just thought it's interesting how far you can push that. In the end it worked but turned out there is no magic, disks die sooner or later and sometimes take the whole node with them.

Don't go to ebay and buy broken disks out of believing with ZFS these will work. Some survive a while, most die fast, some exhibit strange behavoir.

That RAIDZ is more or less for "let's see where this goes" purposes, backups are in place it's not a production system.


Hah, thanks for the story.

It seems that limited resources often lead to some interesting solutions (and learning new things). A factor that is not very common in VC backed companies.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: