Bug hunting in Btrfs

JonChesterfield · on March 20, 2024

The visualisation of the data race in this post is superb. Worth reading just for that.

Handrolled concurrency using atomic integers in C. Without the proof system or state machine model to support it. Seems likely that's not going to be their only race condition.

re · on March 20, 2024

The animations also stood out to me. I took a look at the source to see if they used a library or were written completely by hand and was surprised to see that they were entirely CSS-based, no JavaScript required (although the "metadata" overwrite animation is glitched if you disable JS, for some reason not immediately apparent to me).

bhaney · on March 21, 2024

> for some reason not immediately apparent to me

The CSS that sets the background of the code elements (so they properly cover up the ones under them) is attached to the hljs class, which is added to those elements by highlight.js.

No JS > no classname added > no background style > jank

tavianator · on March 21, 2024

Thanks for pointing this out to me! I'll see if I can pre-render the syntax highlighting like I do with KaTeX and stuff. No reason to do that on the client.

russell_sears · on March 21, 2024

This looks like the sort of bug I'd write back when I used mutexes to write I/O routines. These days, I'd use a lock-free state machine to encode something like this:

   NOT_IN_CACHE -> READING -> IN_CACHE

(the real system would need states for cache eviction, and possibly page mutation).

Readers that encounter the READING state would insert a completion handler into a queue, and readers transitioning out of the READING state would wake up all the completion handlers in the queue.

I've been working on an open source library and simple (manual) proof system that makes it easy to verify that the queue manipulation and the state machine manipulation are atomic with respect to each other:

https://docs.rs/atomic-try-update/0.0.2/atomic_try_update/

The higher level invariants are fairly obvious once you show that the interlocks are correct, and showing the interlocks are correct is just a matter of a quick skim of the function bodies that implement the interlocks for a given data type.

I've been looking for good write ups of these techniques, but haven't found any.

tavianator · on March 21, 2024

The existing btrfs code does use a lock-free state machine for this, `eb->bflags`, that sort of mirrors regular page flags (hence `UPTODATE`, `DIRTY`, etc.).

But Linux kernel APIs like test_bit(), set_bit(), clear_bit(), test_and_set_bit() etc. only work on one bit at a time. The advantage is they can avoid a CAS loop on many platforms. The disadvantage is you only get atomic transitions for one bit at a time. So the `READING -> UPTODATE` transition is more like

    READING -> (READING | UPTODATE) -> UPTODATE

And the `NOT_IN_CACHE -> READING` transition is not fully atomic at all:

    if (!(bflags & UPTODATE)) // atomic
                              // race could happen here
        bflags |= READING;    // atomic

The whole state machine could be made atomic with CAS, but that would be (slightly) more expensive.

zogomoox · on March 21, 2024

That not-invented-here locking mechanism was a big shock to me. I'd be very interested to know the rationale behind that, are locking primitives somehow not available in file system code?

tavianator · on March 21, 2024

Locks are perfectly usable in filesystem code, but test_and_set_bit()/wait_on_bit() has lower overhead, so they'll get used as an optimization. This function is called on every metadata read, so the improved performance/scalability of raw atomics over locks can probably make a difference on fast storage.

Also the code used to use locks, and it wasn't any simpler: https://lore.kernel.org/linux-btrfs/20230503152441.1141019-2...

Sakos · on March 21, 2024

Anybody know how the visualisation was done?

tavianator · on March 21, 2024

I did it by hand with CSS: https://github.com/tavianator/tavianator.com/blob/main/src/2...

Sakos · on March 22, 2024

Dang, that seems like a lot of effort. It looks great though and it's such a big help for understanding something like this. I wonder if there are tools to automate this sort of thing. Like, is there a debugger that could highlight code that it passes through and when?

n8henrie · on March 21, 2024

I've been overall very happy with: - Arch on BTRFS RAID1 root, across 3 dissimilar NVME drives (about 7 years, 3 drive replacements for hardware failure, I don't think ZFS supports this configuration) - numerous low-power systems (like Pi3s) on BTRFS root (also going on 7 years for several of these, lighter on resources than ZFS) - Asahi NixOS on BTRFS root (kernel doesn't support ZFS)

My NAS and several larger datasets are on ZFS based on reputation alone, but honestly I've had more data loss scares with ZFS than BTRFS (drives that have disappeared for no reason then reappeared hours later, soft locked and unable to unmount indefinitely, several unfortunate user-error issues with automounting datasets overlaying necessary top-level directories and preventing successful boots), and I find the BTRFS tooling more intuitive.

For my hobbyist-level homelab type needs, I would say I'm overall pretty happy with BTRFS. The only issue I've never been able to resolve is regarding lockups when issuing quotas -- another reason I stick to ZFS for my spinning rust storage drives.

Oh, and the ZFS ability to mount a zvol as a foreign filesystem (words?) lets me `btrfs send` backups to ZFS, which is nice!

frankjr · on March 23, 2024

> about 7 years, 3 drive replacements for hardware failure

You've had 3 SSD failures in just 7 years? Any more details about those disks?

n8henrie · on March 24, 2024

Cheap NVME drives.

And to be fair, 2 of 3 were only suspicious for having a hardware problem so I replaced with a larger drive (and was super nice to have such a straightforward was to expand the size as compared to ZFS).

cies · on March 20, 2024

I had my /home on a subvolume (great as the sub and super share the same space).

When I wanted to reinstall I naively thought I could format the root super volume and keep /home subvolume -- but this was impossible: I had to format /home as well according to the OpenSUSE Tumbleweed installer.

Major problem for me. I now have separate root (btrfs) and home (ext4) partitions.

vetinari · on March 21, 2024

That's how subvolumes work; not just on btrfs, but on zfs as well.

The non-footgun way is to have several subvolumes (@root, @boot, @home for btrfs, rpool/ROOT/system_instance and rpool/USERDATA for zfs), and nothing in else in the volume itself. Then, you wipe the system subvolume and create a new one, with new system. Or just keep the old system subvolume and create a new one.

stryan · on March 20, 2024

You can do it but it's not a very happy path. Easiest way is probably to map the old home subvolume into a different path and either re-label the subvolumes once you're installed or just copy everything over.

Separate BTRFS root and ext4 home partitions is either the default filesystem layout now if you're not doing FDE or the second recommended one.

cies · on March 21, 2024

I do disk encryption of root and home. Not of boot and swap (the attack needed to steal my data in that scenario is too involved: according to my assessment my data is not that valuable)

londons_explore · on March 20, 2024

One btrfs bug which is 100% reproducible:

* Start with an ext3 filesystem 70% full.

* Convert to btrfs using btrfs-convert.

* Delete the ext3_saved snapshot of the original filesystem as recommended by the convert utility.

* Enable compression (-o compress) and defrag the filesystem as recommended by the man page for how to compress all existing data.

It fails with out of disk space, leaving a filesystem which isn't repairable - deleting files will not free any space.

The fact such a bug seems to have existed for years, with such a basic following of the man pages for a common use case (migration to btrfs to make use of its compression abilities to get more free space), tells me that it isn't yet ready for primetime.

orev · on March 20, 2024

As a Linux user since kernel 0.96, I have never once considered doing an in-place migration to a new file system. That seems like a crazy thing to try to do, and I would hope it’s only done as a last resort and with all data fully backed up before trying it.

I would agree that if this is presented in the documentation as something it supports, then it should work as expected. If it doesn’t work, then a pull request to remove it from the docs might be the best course of action.

londons_explore · on March 20, 2024

The design of the convert utility is pretty good - the convert is effectively atomic - at any point during conversion, if you kill power midway, the disk is either a valid ext4 filesystem, or a valid btrfs filesystem.

jorvi · on March 21, 2024

It is atomic, but at least for me it left both types of checks that BTRFS can do with unrecoverable errors for a few blocks.

All in all not too terrible and I just accepted the bogus data as good by resetting (?) the checksums for those blocks. But acting as if it’s just a terminal command and Bob’s your uncle.. that’s just wrong.

thfuran · on March 20, 2024

>That seems like a crazy thing to try to do

It seems like a reasonable thing to want to do. Would you never update an installed application or the kernel to get new features or fixes? I don't really think there's much fundamental difference. If what you mean is "it seems likely to fail catastrophically", well that seems like an indication that either the converter or the target filesystem isn't in a good state.

orev · on March 20, 2024

Live migration of a file system is like replacing the foundation of your house while still living in it. It’s not the same thing as updating apps which would be more like redecorating a room. Sure it’s possible, but it’s very risky and a lot more work than simply doing a full backup and restore.

tredre3 · on March 21, 2024

What's a full backup and restore in your analogy? Building a new house? Moving the house to a safe place, replace the foundation, bring the house back?

geraldhh · on March 21, 2024

> Moving the house to a safe place, replace the foundation, bring the house back?

this seems sensible

bartvk · on March 20, 2024

> It seems like a reasonable thing to want to do

This was actually a routine thing to do, under iOS and macOS, with the transition from HFS to APFS.

thijsvandien · on March 20, 2024

Don't forget about FAT32 to NTFS long before that.

geraldhh · on March 21, 2024

both transitions were mandated by a supporting organization thou.

yjftsjthsd-h · on March 20, 2024

There's a world of difference between a version update and completely switching software. Enabling new features on an existing ext4 file system is something I would expect to be perfectly safe. In-place converting an ext4 file system into btrfs... in an ideal world that would work, of course, but it sounds vastly more tricky even in the best case.

KronisLV · on March 20, 2024

> Would you never update an installed application or the kernel to get new features or fixes?

Honestly, if that was an option without the certainty of getting pwned due to some RCE down the road, then yes, there are cases where I absolutely wouldn't want to update some software and just have it chugging away for years in its present functional state.

thfuran · on March 20, 2024

And there are no cases where you actually want a new feature?

KronisLV · on March 21, 2024

> And there are no cases where you actually want a new feature?

Sure there are, sometimes! But at the same time, in other situations stability and predictability would take precedence over anything else, e.g. any new features that might get released probably wouldn't matter too much for a given use case.

For example, I could take a new install of MariaDB that has come out recently and use it for the same project for 5 years with no issues, because the current feature set would be enough for the majority of use cases. Of course, that's not entirely realistic, because of the aforementioned security situation.

Applies the same to OS and/or kernel versions, like how you could take a particular version of RHEL or Ubuntu LTS and use it for the lifespan of some project, although in that case you do at least get security updates and such.

tuyiown · on March 20, 2024

I don't know if you talk only about linux or you meant your comment as a generalization, but have you heard of in place APFS migration from HFS+ ?

bayindirh · on March 20, 2024

Similarly Microsoft offered FAT32 to NTFS in-place migration, and it did the required checks before starting to ensure it completes successfully. It was more than 20 years ago IIRC.

tuyiown · on March 21, 2024

The notable thing with APFS is that the migration came automatically with an OS update (to high sierra) by default.

bayindirh · on March 21, 2024

Yes, I think also mounting ext3 systems as ext4 and migrating it to ext4 as you access it is a good way too. IIRC you had to enable a couple of flags in the ext3 system with tune2fs to enable ext4 features as well, but it didn't break, and migrated as you write & delete files.

jcalvinowens · on March 20, 2024

It's open source: if things don't get fixed, it's because no user cares enough to fix it. I'm certainty not wasting my time on that, nobody uses ext3 anymore!

You are as empowered to fix this as anybody else, if it presents a real problem for you.

romanows · on March 20, 2024

I think the parent comment's fix is to not use btrfs and warn others about how risky they estimate it is.

noncoml · on March 20, 2024

> You are as empowered to fix this as anybody else, if it presents a real problem for you.

I’m getting sick and tired of this argument. It’s very user unfriendly

Do you work in software? If yes, do you own a component? Have you tried using this argument with your colleagues that have no idea about your component?

Of course they theoretically are empowered to fix the bug, but that doesn’t make it easier. They may have no idea about how filesystem internal. Or they may be just users and have no programming background at all

jcalvinowens · on March 21, 2024

It's not unfriendly, it's just how it is. The fact you're comparing it to industry shows you don't get it.

Open source doesn't work for you. You can't demand that it do things you want: you have to push and justify them yourself, and hopefully you find other people along the way who want the same things and can help you.

I taught myself to program as a teenager trying to fix bugs in Linux. If I could do it, anybody can. That's what makes it empowering.

mardifoufs · on March 21, 2024

Well I think you're completely right, in a way. But I also don't see the problem with people telling other people to not use the software or to complain about it. They are totally in their right to complain about stuff. Now obviously if the complaining involves requiring people to work for some feature you want, that's entitled and wack. Same goes for throwing a fit because maintainers went for systemd instead of your own favorite init system.

But being open source doesn't mean people shouldn't or can't just say "this is bad" or "don't use x" or even "this X feature doesn't work when using Y" (now If what they are saying isn't true, sure we should call them out)

It's normal and in fact I think it should be encouraged, as not everyone can be aware of potential problems they could have with a piece of OSS that is already known in some obscure dev mailing list. The more people are informed and aware of potential issues, the less they will be surprised and complain about it when they start using it.

(Fwiw I love btrfs, and I think it's very reliable)

noncoml · on March 21, 2024

Read the thread again. No one asked anyone to work for free.

Please don’t put words in other peoples mouth.

bayindirh · on March 20, 2024

You can reliably change ext3 with ext4 in the GP's comment. It's not an extX problem, it's an BTRFS problem.

If nobody cares that much, maybe deprecate the tool, then?

Also, just because it's Open Source (TM) doesn't mean developers will accept any patch regardless of its quality. Like everything, FOSS is 85% people, 15% code.

> You are as empowered to fix this as anybody else, if it presents a real problem for you.

I have reported many bugs in Open Source software. If I had the time to study code and author a fix, I'd submit the patch itself, which I was able to do, a couple of times.

Zardoz84 · on March 20, 2024

for me it's a problem of the tool converting in place EXT to BTRFS. Not necessarily a problem of BTRFS.

bayindirh · on March 20, 2024

I tried to say it’s a problem of BTRFS the project. Not BTRFS the file system.

jcalvinowens · on March 20, 2024

> If nobody cares that much, maybe deprecate the tool, then?

If you think that's the right thing to do, you're as free as anybody else to send documentation patches. I doubt anybody would argue with you here, but who knows :)

> Also, just because it's Open Source (TM) doesn't mean developers will accept any patch regardless of its quality.

Of course not. If you want to make a difference, you have to put in the work. It's worth it IMHO.

bayindirh · on March 20, 2024

> If you think that's the right thing to do, you're as free as anybody else to send documentation patches :)

That won't do anything. Instead I can start a small commotion by sending a small request (to the mailing lists) to deprecate the tool, which I don't want to do. Because I'm busy. :)

Also, I don't like commotions, and prefer civilized discussions.

> Of course not. If you want to make a difference, you have to put in the work. It's worth it IMHO.

Of course. This is what I do. For example, I have a one liner in Debian Installer. I had a big patch in GDM, but after coordinating with the developers, they decided to not merge the fix + new feature, for example.

jcalvinowens · on March 20, 2024

> For example, I have a one liner in Debian Installer. I had a big patch in GDM, but after coordinating with the developers, they decided to not merge the fix + new feature, for example

Surely you were given some justification as to why they didn't want to merge it? I realize sometimes these things are intractable, but in my experience that's rare... usually things can iterate towards a mutually agreeable solution.

bayindirh · on March 20, 2024

The sad part is they didn’t.

GTK has (or had) a sliding infoline widget, which is used to show notifications. GDM used it for password related prompts. It actually relayed PAM messages to that widget.

We were doing mass installations backed by an LDAP server which had password policies, including expiration enabled.

That widget had a bug, prevented it displaying a new message when it was in the middle of an animation, which effectively ate messages related to LDAP (Your password will expire in X days, etc.).

Also we needed a keyboard selector in that window, which was absent.

I gave heads up to the GDM team, they sent a “go ahead” as a reply. I have written an elaborate patch, which they rejected and wanted a simpler one. I iterated the way they wanted, they said it passes the muster and will be merged.

Further couple of mails never answered. I’m basically ghosted.

But the merge never came, and I moved on.

baq · on March 20, 2024

This sort of thing happens all the time in big orgs which develop a single product internally with a closed source… there isn’t any fix for this part of human nature, apparently.

cesarb · on March 20, 2024

> deleting files will not free any space.

Does a rebalance fix it? I have once (and only once, back when it was new) hit a "out of disk space" situation with btrfs, and IIRC rebalancing was enough to fix it.

> for a common use case

It might have been a common use case back when btrfs was new (though I doubt it, most users of btrfs probably created the filesystem from scratch even back then), but I doubt it's a common use case nowadays.

bayindirh · on March 20, 2024

From my perspective, A filesystem is a critical infrastructure in an OS, and failing here and there and not fixing these bugs because they're not common is not acceptable.

Same for the RAID5/6 bugs in BTRFS. What's their solution? A simple warning in the docs:

> RAID5/6 has known problems and should not be used in production. [0]

Also the CLI discourages you from creating these things. Brilliant.

This is why I don't use BTRFS anywhere. An FS shall be bulletproof. Errors must only cause from hardware problems. Not random bugs in a filesystem.

[0]: https://btrfs.readthedocs.io/en/latest/mkfs.btrfs.html#multi...

chronid · on March 20, 2024

Machines die. Hardware has bugs, or is broken. Things just bork. It's a fact of life.

Would I build a file storage system around btrfs? No - without proper redundancy at least. But I'm told at least Synology does.

I'm pretty sure there's plenty of cases where it's perfectly usable - the feature set it has today is plenty useful and the worst case scenario is an host reimage.

I can live with that. applications will generally break production ten billion times before btrfs does.

bayindirh · on March 20, 2024

> Machines dies. Hardware has bugs, or is broken. Things just bork. It's a fact of life.

I know, I'm a sysadmin. I care for hardware, mend it, heal it, and sometimes donate, cann-bird or bury it. I'm used to it.

> worst case scenario is an host reimage...

While hosting PBs of data on it? No, thanks.

> Would I build a file storage system around btrfs? No - without proper redundancy at least.

Everything is easy for small n. When you store 20TB on 4x5TB drives, everything can be done. When you have a >5PB of storage on racks, you need at least a copy of that system running hot-standby. That's not cheap in any sense.

Instead, I'd use ZFS, Lustre, anything, but not BTRFS.

> I can live with that - applications will generally break production ten billion times before btrfs does.

In our case, no. Our systems doesn't stop because a daemon decided to stop because a server among many fried itself.

chronid · on March 21, 2024

I have worked on and around systems with an order of magnitude more data and a single node failing did not matter. We weren't using btrfs anyway (for data drives) and it definitely was not cheap. But storage never is.

But again, most systems are not like that. Kubernetes cluster nodes? Reimage at will. Compute nodes for vms backed by SAN? Reimage at will. Btrfs can actually make that reimage faster and it's pretty reliable on a single flash drive so why not?

bayindirh · on March 21, 2024

Well, that was my primary point. BTRFS is not ready for these kind of big installations handled by ZFS or Lustre at this point.

On the other hand, BTRFS’ single disk performance, esp, for small files is visibly lower than EXT4 and XFS, so why bother?

There are many solutions for EXT4 which allows versioning, and if I can reimage a node (or 200) in 5 minute flat, why should I bother with the overhead of BTRFS?

It’s not that I haven’t tried BTRFS. Its features are nice, but from my perspective, it’s not ready for prime time, yet. What bothers me is the mental gymnastics pretending that it’s mature at this point.

It’ll be good file system. An excellent one in fact, but it still needs to cook.

wongarsu · on March 20, 2024

My impression of btrfs is that it's very useful and stable if you stay away from the sharp edges. Until you run into some random scenario that leads you to an unrecoverable file system.

But it has been that way for now 14 years. Sure, there are far fewer sharp edges now than there were back then. For a host you can just reimage it's fine, for a well-tested fairly restricted system it's fine. I stay far away from it for personal computers and my home-built NAS, because just about any other fs seems to be more stable.

bayindirh · on March 21, 2024

The thing is, none of the systems I have the luxury to run a filesystem which can randomly explode any time because I pressed a button developers didn't account for, yet.

I have bitten by ReiserFS' superblock corruption once, and that time I had plenty of time to rebuild my system leisurely. My current life doesn't allow for that. I need to be able to depend on my systems.

Again, I believe BTRFS will be an excellent filesystem in the long run. It's not ready yet for "format, mount and forget" from my perspective. Only I'm against is, "it runs on my machine, so yours' is a skill issue" take, which is harmful on many levels.

roelschroeven · on March 21, 2024

Synology uses btrfs on top of classic mdadm RAID; AFAIK they don't use btrfs's built-in RAID, or even any of btrfs's more advanced features.

nolist_policy · on March 20, 2024

You do you.

Personally, btrfs just works and the features are worth it.

Btrfs raid always gets brought up in these discussions, but you can just not use it. The reality is that it didn't have a commercial backer until now with Western Digital.

bayindirh · on March 21, 2024

If it works for you, then it's great. However, this doesn't change the fact that it does not work for many others.

If I'm just not gonna use BTRFS' RAID, I can just use mdadm + any file system I want. In this case, any file system becomes "anything but btrfs" from my point of view.

I've burnt by ReiserFS once. I'm not taking the same gamble with another FS, thanks.

chasil · on March 20, 2024

A rebalance means that every file on the filesystem will be rewritten.

This is drastic, and I'd rather perform such an operation on an image copy.

This is one case where ZFS is absolutely superior; if a drive goes offline, and is returned to a set at a later date, the resilver only touches the changed/needed blocks. Btrfs forces the entire filesystem to be rewritten in a rebalance, which is much more drastic.

I am very willing to allow ZFS mirrors to be degraded; I would never, ever let this happen to btrfs if at all avoidable.

o11c · on March 20, 2024

The desired "compress every file" operation will also cause every file on the filesystem to be rewritten though ...

eru · on March 20, 2024

> It might have been a common use case back when btrfs was new (though I doubt it, most users of btrfs probably created the filesystem from scratch even back then), but I doubt it's a common use case nowadays.

It's perhaps not as common as it once was, but you'd expect it to be common enough to work, and not some obscure corner case.

lxgr · on March 20, 2024

There’s literally no way I could migrate my NAS other than through an in-place FS conversion since it’s >>50% full.

The same probably applies to may consumer devices.

zaggynl · on March 20, 2024

For what it's worth, I'm happy with using Btrfs on OpenSUSE Tumbleweed and the provided tooling like snapper and restoring Btrfs snapshots from grub, saved me a few times.

SSD used: Samsung SSD 970 PRO 1T, same installation since 2020-05-02.

applied_heat · on March 20, 2024

Btrfs seems to work perfectly well in synology NAS. It must be some other combination of options or functions in use from what is available in synology that garners the bad reputation

gh02t · on March 20, 2024

Synology uses BTRFS weirdly, on top of dmraid, which negates the most well-known BTRFS bugs in their RAID5/6 implementation*. Best I can tell they also have some custom modifications in their implementation as well, though it's hard to find much info on it.

* FWIW, I used native BTRFS RAID5 for years and never had an issue but that's just anecdata

jcalvinowens · on March 20, 2024

This bug is very very rare in practice: all my dev and testing machines run btrfs, and I haven't hit it once in 100+ machine-hours of running on 6.8-rc.

The actual patch is buried at the end the article: https://lore.kernel.org/linux-btrfs/1ca6e688950ee82b1526bb30...

lxgr · on March 20, 2024

100 error-free machine hours isn’t exactly evidence of anything when it comes to FS bugs, though.

tavianator · on March 21, 2024

At first I thought jcalvinowens meant they had actively tried to reproduce this bug for 100 machine hours, which would lend credence to "very very rare in practice". But in 100 machine hours of a typical workload doesn't really show anything about this bug. Also the bug was introduced in v6.5, so very very many btrfs users have avoided this bug for far longer than 100 machine hours.

I do agree it's "very very rare in practice". You need at least 3 threads racing on the same disk block in a ~15 instruction window, during which time you need the other thread to start and finish reading a block from disk. Which means you need very fast I/O and decryption and enough CPU oversubscription to get preempted in the race window. And it only happens when you actually hit disk which means once your machine is in a steady state, most metadata blocks you need will already be cached and you'll never see the bug.

That said, this is not some crazy 1-instruction race window with preemption disabled. With enough threads calling statx() in parallel and regularly dropping caches, I can reproduce it consistently within a few minutes. Way less than 100 hours :)

jcalvinowens · on March 22, 2024

> But in 100 machine hours of a typical workload doesn't really show anything about this bug

It does though: it shows it isn't a dire usability problem. A machine with this bug is still able to do meaningful work, it just might eat it's own tail occasionally.

If I ran N servers with this bug and had an automated thing to whack them when they tripped over it, how many would be out of commission at any given time? All signs point to it being a pretty small number. As long as data loss isn't a concern, a sysadmin would probably choose to wait a couple weeks for the fix through the proper channels, rather than going to a lot of trouble to do it immediately themselves (which also carries risk).

This is also why I was curious about kconfig: my test boxes have all the hardening options turned on right now, I could see how that might make it less likely to trigger.

I completely agree the "uniform random bug" model is only so useful, since an artificial reproducer obviously blows it out of the water. But as I said elsewhere, I've seen it be shockingly predictive when applied to large numbers of machines with bugs like this.

jcalvinowens · on March 20, 2024

Of course it is: it's evidence the average person will never hit this bug. Statistics, and all that.

Having worked on huge deployments of Linux servers before, I can tell you that modelling race condition bugs as having a uniform random chance of happening per unit time is shockingly predictive. But it's not proof, obviously.

lxgr · on March 20, 2024

> modelling race condition bugs as having a uniform random chance of happening per unit time is shockingly predictive

I don't generally disagree with that methodology, but 100 hours is just not a lot of time.

If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed, and yet it will still affect nearly 100% of all of your users after a bit more than a month!

For file systems, the aim should be (much more than) five nines, not nine fives.

jcalvinowens · on March 20, 2024

You edited this in after I replied, or maybe I missed it:

> If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed, and yet it will still affect nearly 100% of all of your users after a bit more than a month!

I understand the point you're trying to make here, but 1000 is just an incredibly unrealistically small number if we're modelling bugs like that. The real number might be on the order of millions. The effect you're describing in real life might take decades: weeks is unrealistic.

lxgr · on March 20, 2024

I agree that a realistic error rate for this particular bug is much lower than 1 in 1000 hours (or it would have long been caught by others).

But that makes your evidence of 100 error-free hours even less useful to make any predictions about stability!

jcalvinowens · on March 22, 2024

> But that makes your evidence of 100 error-free hours even less useful to make any predictions about stability!

You're still conflating probabilities an event occurs across a group with the probability an event happens to one specific individual in that group (in this case, me). I'm talking about the second thing, and it's very much not the same.

If I could rewind the universe and replay it many many times, some portion of those times I will either be very lucky or very unlucky, and get an initial testing result that badly mispredicts my personal future. But we know that most of the time, I won't.

I can actually prove that. Because of the simple assumptions we're making, we can directly compute the probabilities we are initially that wrong:

  Odds we test 1000-hour bug for 1000 hours without tripping: 0.999^1000 = 36.7%
  Odds we test 50-hour bug for 100 hours without tripping: 0.98^100 = 13.3%
  Odds we test 10-hour bug for 100 hours without tripping: 0.9^100 = 0.002%

Under our spherical cow assumptions, my 100 hours is a very convincing demonstration the real bug rate is less than one per 10 hours.

Of course, in the real world, you might never hit the bug because you have to pat yourself on the head while singing Louie Louie and making three concurrent statx() calls on prime numbered CPUs with buffers off-by-one from 1GB alignment while Mars is in the fifth house to trigger it... it's just a model, after all.

jcalvinowens · on March 20, 2024

> If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed

Yes. Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the hypothetical bug. Thank you for proving my point!

I'm not a filesystem developer, I'm a user: as a user, I don't care about the long tail, I only care about the average case as it relates to my deployment size. As you correctly point out, my deployment is of negligible size, and the long tail is far far beyond my reach.

Aside: your hypothetical event has a 0.1% chance of happening each hour. That means it has a 99.9% chance of not happening each hour. The odds it doesn't happen after 100 hours is 0.999^100, or 90.5%. I think you know that, I just don't want a casual reader to infer it's 90% because 1-(100/1000) is 0.9.

lxgr · on March 20, 2024

> Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the bug.

No, that's not how probabilities work at all for a bug that happens with uniform probability (i.e. not bugs that deterministically happen after n hours since boot). If you have millions of users, some of them will hit it within hours or even minutes after boot!

> As you correctly point out, my deployment is of negligible size, and the long tail is far far beyond my reach.

So you don't expect to accrue on the order of 1000 machine-hours in your deployment? That's only a month for a single machine, or half a week for 10. That would be way too much for me even for my home server RPi, let alone anything that holds customer data.

> I'm not a filesystem developer, I'm a user: I don't care about the long tail, I only care about the average case as it relates to my deployment size.

Yes, but unfortunately you seem to either have the math completely wrong or I'm not understanding your deployment properly.

jcalvinowens · on March 20, 2024

> So you don't expect to accrue on the order of 1000 machine-hours in your deployment?

The 1000 number came from you. I have no idea where you got it from. I suspect the "real number" is several orders of magnitude higher, but I have no idea, and it's sort of artificial in the first place.

My overarching point is that mine is such a vanishingly small portion of the universe of machines running btrfs that I am virtually guaranteed that bugs will be found and fixed before they affect me, exactly as happened here. Unless you run a rather large business, that's probably true for you too.

The filesystem with the most users has the least bugs. Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

> If you have millions of users, some of them will hit it within hours or even minutes after boot!

This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent. At worst it's a nuisance to restore a backup.

lxgr · on March 20, 2024

> The 1000 number came from you. I have no idea where you got it from,

It's an arbitrary example of an error rate you'd have a 90% chance of missing in your sample size of 100 machine-hours, yet much too high for almost any meaningful application.

I have no idea what the actual error rate of that btrfs bug is; my only point is that your original assertion of "I've experienced 100 error-free hours, so this is a non-issue for me and my users" is a non sequitur.

> This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent.

I don't know what to say to that other than that I wish I had your optimism on reliable system design practices across various industries.

Maybe there's a parallel universe where people treat every file system as having an error rate of something like "data corruption/loss once every four days", but it's not the one I'm familiar with.

For better or worse, the bar for file system reliability is much, much, much, much higher than anything you could reasonably produce empirical data for unless you're operating at Google/AWS etc. scale.

jcalvinowens · on March 20, 2024

> "I've experienced 100 error-free hours, so this is a non-issue for me and my users"

It's a statement of fact: it has been a non-issue for me. If you're like me, it's statistically reasonable to assume it will be a non-issue for you too. Also, no users, just me. "Proabably okay" is more than good enough for me, and I'm sure many people have similar requirements (clearly not you).

I have no optimism, just no empathy for the negligent: I learned my lesson with backups a long time ago. Some people blame the filesystem instead of their backup practices when their data is corrupted, but I think that's naive. The filesystem did you a favor, fix your shit. Next time it will be your NAS power supply frying your storage.

It's also a double edged sword: the more reliable a filesystem is, the longer users can get away without backups before being bitten, and the greater their ultimate loss will be.

lxgr · on March 20, 2024

> It's a statement of fact: it has been a non-issue for me.

Yes...

> If you're like me, it's statistically reasonable to assume it will be a non-issue for you too.

No! This simply does not follow from the first statement, statistically or otherwise.

You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

jcalvinowens · on March 20, 2024

> No! This simply does not follow from the first statement, statistically or otherwise.

> You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

You're missing the forest for the trees here.

It is predictive ON AVERAGE. I don't care about the worst case like you do: I only care about the expected case. If I died when my filesystem got corrupted... I would hope it's obvious I wouldn't approach it this way.

Adding to this: my laptop has this btrfs bug right now. I'm not going to do anything about it, because it's not worth 20 minutes of my time to rebuild my kernel for a bug that is unlikely to bite before I get the fix in 6.9-rc1, and would only cost me 30 minutes of time in the worst case if it did.

I'll update if it bites me. I've bet on much worse poker hands :)

lxgr · on March 20, 2024

Well, from your data (100 error-free hours, sample size 1) alone, we can only conclude this: “The bug probably happens less frequently than every few hours”.

Is that reliable enough for you? Great! Is that “very rare”? Absolutely not for almost any type of user/scenario I can imagine.

If you’re making any statistical arguments beyond that data, or are implying more data than that, please provide either, otherwise this will lead nowhere.

Dylan16807 · on March 21, 2024

> I only care about the expected case.

The expected case after surviving a hundred hours is that you're likely to survive another hundred.

Which is a completely useless promise.

That piece of data doesn't let you predict anything at reasonable time scales for an OS install.

You can't squeeze more implications out of such a small sample.

jcalvinowens · on March 22, 2024

I don't care about the aggregate: I only care about me and my machine here.

> The expected case after surviving a hundred hours is that you're likely to survive another hundred.

That's exactly right. I don't expect to accrue another hundred hours before the new release, so I'll likely be fine.

> Which is a completely useless promise.

Statistics is never a promise: that's a really naive concept.

> at reasonable time scales for an OS

The timescale of the OS install is irrelevant: all that matters is the time between when the bug is introduced and when it is fixed. In this case, about nine months.

Dylan16807 · on March 22, 2024

You only use your machines for twenty hours per month?

Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

> Statistics is never a promise: that's a really naive concept.

It's a promise of odds with error bars, don't be so nitpicky.

jcalvinowens · on March 22, 2024

> Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

You're free to disagree, obviously, but I think it's accurate to describe a race condition that doesn't happen in 100 hours on a multiple machines with clock rates north of 3GHz as "very very rare". That particular code containing the bug has probably executed tens of millions of times on my little pile of machines alone.

> It's a promise of odds with error bars, don't be so nitpicky.

No, it's not. I'm not being nitpicky, the word "promise" is entirely inapplicable to statistics.

Dylan16807 · on March 22, 2024

If my computer has a filesystem error that happens every week of uptime (168 machine hours), I call that "common".

bmicraft · on March 21, 2024

> Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

So you haven't heard of zfs then?

kbolino · on March 20, 2024

A single 9 in reliability over 100 hours would be colossally bad for a filesystem. For the average office user, 100 hours is not even a month's worth of daily use.

Even as an anecdote this is completely useless. A couple thousand hours and dozens of mount/unmount cycles would just be a good start.

paulddraper · on March 20, 2024

> Yes. Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the hypothetical bug. Thank you for proving my point!

So....that's really bad.

BenjiWiebe · on March 20, 2024

If I'm running btrfs on my NAS, that's only ~4 days of runtime. If there's a bug that trashes the filesystem every month on average, that's really bad and yet is very unlikely to get caught in 4 days of running.

7bit · on March 20, 2024

> Of course it is: it's evidence the average person will never hit this bug. Statistics, and all that.

Anecdotal statistics maybe.

kalleboo · on March 21, 2024

I've been running a public file server[0] on a 1994 vintage PowerBook 540c running Macintosh System 7.5 using the HFS file system on an SD card (via a SCSI2SD adapter) for like 400 hours straight this month, with zero issues.

Not once would I even insinuate that after my 400 hours of experience, HFS is a file system people should rely on.

100 hours says nothing. When I read your comment I just assumed that you had typoed 100 machine-years (as in running a server farm) as that would have been far more relevant.

[0] Participating in the MARCHintosh GlobalTalk worldwide AppleTalk network full of vintage computers and emulators

valicord · on March 20, 2024

You do understand that 100 hours is not even 5 days, right?

tavianator · on March 20, 2024

Do they run btrfs on top of dm-crypt? I suspect it's impossible to reproduce on a regular block device.

jcalvinowens · on March 20, 2024

Did this ever trip any of the debugging ASSERT stuff for you? I'm really curious if some more generic debugging instrument might be able to flag this failure mode more explicitly, it's far from the first ABA problem with ->bflags.

Also, if you don't mind sharing your kconfig that would be interesting to see too.

tavianator · on March 20, 2024

No, but I may have been missing some debugging CONFIGs. I was just using the stock Arch kconfig.

I did submit a patch to add a WARN_ON() for this case: https://lore.kernel.org/linux-btrfs/d4a055317bdb8ecbd7e6d9bd...

But to catch the general class of bugs you'd need something like KCSAN. I did try that but the kernel is not KCSAN-clean so it was hard to know if any reports were relevant.

jcalvinowens · on March 20, 2024

The WARN is great.

Something as general as KCSAN isn't necessary: it's a classic ABA problem, double checking the ->bflags value on transitions is sufficient to catch it. Like a lockdep-style thing where *_bit() are optionally replaced by helpers that check the current value and WARN if the transition is unexpected.

Using the event numbering from your patch description, such a thing would have flagged seeing UPTODATE at (2). But the space of invalid transitions is is much larger than the space of valid ones, which is why I think it might help catch other future bugs sooner.

Dunno if it's actually worth it, but I definitely recall bugs of this flavor in the past. It'll take some work to unearth them all, alas...

tavianator · on March 20, 2024

I like that idea, but it isn't really compatible with my fix. `bflags` still makes the same transitions, I just skip the actual read in the `UPTODATE | READING` case.

jcalvinowens · on March 20, 2024

I think after your fix the set of valid states in that spot would just include UPTODATE. But this is all very poorly thought out on my part...

jcalvinowens · on March 20, 2024

Yes, most of them do, precisely because I'm trying to catch more bugs :)

Daunk · on March 20, 2024

I recently tried (for the first time) Btrfs on my low-end laptop (no snapshots), and I was surprised to see that the laptop ran even worse than it usually does! Turns out there was something like a "btrfs-cleaner" (or similar) running in the background, eating up almost all the CPU at all time. After about 2 days I jumped over to ext4 and everything ran just fine.

mritzmann · on March 20, 2024

Had a similar problem but can't remember the Btrfs process. Anyway, after I switched off Btrfs quotas, everything was fine.

nolist_policy · on March 20, 2024

What was your workload? Do you have quotas enabled? Compression? Are you running OpenSuse by any chance?

Daunk · on March 20, 2024

Workload was literally zero, I just logged into XFCE and could barely do anything for 2 days straight. No quotas and no compression, but it was indeed openSUSE!

nolist_policy · on March 20, 2024

That explains it, because openSUSE uses snapshots and quotas by default. It creates a snapshot before and one after every package manager interaction and cleans up old snapshots once per day.

Unfortunately, deleting snapshots with quotas is an expensive operation that needs to rescan some structures to keep the quota information consistent and that is what you're seeing.

Daunk · on March 20, 2024

I'm not sure that's correct. When you install openSUSE (this was a clean install) there's a checkbox that asks if you want snapshots, which I did not enable. But either way, a fresh openSUSE install with XFCE on Btrfs rendering the computer unusable for, at least, 2 days is not okay in my book, even if snapshots were enabled.

mustache_kimono · on March 20, 2024

> I recently tried Btrfs on my low-end laptop (no snapshots)

Do snapshots degrade the performance of btrfs?

j16sdiz · on March 21, 2024

No noticeably. you see the difference when enabling CoW though

eru · on March 20, 2024

Interesting that the 'cleaner' doesn't run as nice?

rcthompson · on March 20, 2024

I'm pretty sure it's a kernel thread, not a process, since it's part of the filesystem. So it can't be renice'd.

eru · on March 21, 2024

Oh, that's annoying. I wonder if there's a way with kernel configs etc to only run that thread when there's nothing better to do? Sounds like something someone would have thought of in general for background house keeping?

e145bc455f1 · on March 20, 2024

Just last week my btrfs filesystem got irrecoverably corrupted. This is like the fourth time it has happened to me in the last 10 years. Do not use it in consumer grade hardware. Compared to this, ext4 is rock solid. It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.

londons_explore · on March 20, 2024

> It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.

I have also done this, and was also happy that the only corruption was to a handful of unimportant log files. Part of a robust filesystem is that when the user does something stupid, the blast radius is small.

Other less-smart filesystems could easily have said "root of btree version mismatch, deleting bad btree node, deleting a bunch of now unused btree nodes, your filesystem is now empty, have a nice day".

gmokki · on March 20, 2024

I have had same btrfs filesystem in use for 15+ years, with 6 disks of various sizes. And all hardware components changed at least once during the fileystsen lifetime.

Worst corruption was when one DIMM started corrupting data. As a result computer kept crashing and eventually refused to mount because of btrfs checksum mismatches.

Fix was to buy new HW. Then run btrfs filesystem repairs, which failed at some point but at least got the filesystem running as long as I did not touch the most corrupted locations, luckily it was RAID1 so most checksums had a correct value on another disk. Unfortunately the checksum tree had on two locations corruption on both copies. I had to open the raw disks with hex editor and change the offending byte to correct value, after which the filesystem has been running again smoothly for 5 years.

And to find the location to modify on the disks I built a custom kernel that printed the expected value and absolute disk position when it detected the specific corruption. Plus had to ask a friend to double check my changes since I did not have any backups.

matja · on March 20, 2024

> running again smoothly for 5 years

So did you bite the bullet and get ECC, or are you just waiting for the next corruption caused by memory errors? :)

londons_explore · on March 20, 2024

> last week my btrfs filesystem got irrecoverably corrupted.

This is 2 bugs really. 1, the file system got corrupted. 2, tooling didn't exist to automatically scan through the disk data structures and recover as much of your drive as possible from whatever fragments of metadata and data were left.

For 2, it should happen by default. Most users don't want a 'disk is corrupt, refusing to mount' error. Most users want any errors to auto-correct if possible and get on with their day. Keep a recovery logfile with all the info needed to reverse any repairs for that small percentage of users who want to use a hex editor to dive into data corruption by hand.

TillE · on March 20, 2024

Yeah the last time I had a btrfs volume die, there were a few troubleshooting/recovery steps on the wiki which I dutifully followed. Complete failure, no data recoverable. The last step was "I dunno, go ask someone on IRC." Great.

It's understandable that corruption can happen due to bugs or hardware failure or user insanity, but my experience was that the recovery tools are useless, and that's a big problem.

rcthompson · on March 20, 2024

Writing to a corrupted filesystem by default is bad design. The corruption could be caused by a hardware problem that is exacerbated by further writes, leading to additional data loss.

mrob · on March 20, 2024

Where is that log file supposed to be stored? It can't be on the same filesystem it was created for or it negates the purpose of its creation.

londons_explore · on March 20, 2024

If I were designing it, the recovery process would:

* scan through the whole disk and, for every sector, decide if it is "definitely free space (part of the free space table, not referenced by any metadata)", "definitely metadata/file data", "unknown/unsure (ie. perhaps referenced by some dangling metadata/an old version of some tree nodes)".

* I would then make a new file containing a complete image of the whole filesystem pre-repair, but leaving out the 'definitely free space' parts.

* such a file takes nearly zero space, considering btrfs's copy-on-write and sparse-file abilities.

* I would then repair the filesystem to make everything consistent. The pre-repair file would still be available for any tooling wanting to see what the filesystem looked like before it was repaired. You could even loopmount it or try other repair options on it.

* I would probably encourage distros to auto-delete this recovery file if disk space is low/after some time, since otherwise the recovery image will end up pinning user data to using up disk space for years and users will be unhappy.

The above fails in only one case: Free space on the drive is very low. In that case, I would probably just do the repairs in-RAM and mount the filesystem readonly, and have a link to a wiki page on possible manual repair routes.

j16sdiz · on March 21, 2024

>The above fails in only one case: Free space on the drive is very low.

No. Most of the block will be marked as unsure in first step -- because most of them had been used before thanks to CoW

londons_explore · on March 21, 2024

A heuristic could be written like 'protect the latest version of each node, plus 2 prior versions, but anything older you find, treat it as free apace'.

nolist_policy · on March 20, 2024

Best send a bugreport to the btrfs mailing list at linux-btrfs@vger.kernel.org.

If possible include the last kernel log entries before it corrupted. Include kernel version, drive model and drive firmware version.

thequux · on March 21, 2024

Huh. I've been running btrfs on a number of systems for probably 12 years at this point. One array in particular was 12TiB of raw storage used for storing VM images in heavy use. Each disk had ~9 years of spindle-on time before I happened to look closely at the SMART output and realized that they were all ST3000DM001's and promptly swapped them all out. The only issue I've ever run into is running out of metadata chunks and needing to rebalance, and that was just once.

matheusmoreira · on March 20, 2024

> Compared to this, ext4 is rock solid.

Ext4 is the most reliable file system I have ever used. Just works and has never failed on me, not even once. No idea why btrfs can't match its quality despite over a decade of development.

riku_iki · on March 20, 2024

how do you know it was issue with FS and not actual hardware/disk?..

viraptor · on March 20, 2024

Yeah, that's the fun part of the ext/btrfs corruption posts. If you got repeating corruption on btrfs on the same drive but not on ext, how do you know it's not just a drive failure that ext is not able to notice? What would happen if you tried ext with dm-integrity?

champtar · on March 21, 2024

Bad DIMM is a thing, even more so on consumer HW that lack ECC. I recommend you run memtest

west0n · on March 20, 2024

The vast majority of databases currently recommend using XFS.

londons_explore · on March 20, 2024

All modern databases do large block streaming appending writes, and small random reads, usually of just a handful of files.

It ought to be easy to design a filesystem which can have pretty much zero overhead for those two operations. I'm kinda disappointed that every filesystem doesn't perform identically for the database workload.

I totally understand that different file systems would do different tradeoffs affecting directory listing, tiny file creation/deletion, traversing deep directory trees, etc. But random reads of a huge file ought to perform near identically to the underlying storage medium.

rubiquity · on March 20, 2024

What you’re describing is basically ext4. I do know there are some proprietary databases that use raw block devices. The downside being that you also need to make your own user space utilities for inspecting and managing it.

west0n · on March 20, 2024

I'm very curious if there are any databases running in production environments on Btrfs or ZFS.

yjftsjthsd-h · on March 20, 2024

It wasn't a company you've heard of, but I can absolutely tell you that there are:) We really benefited from compressing our postgres databases; not only did it save space, but in some cases it actually improved throughput because data could be read and uncompressed faster than the disks were physically capable of.

Edit: Also there was a time when it was normal for databases to be running on Solaris, in which case ZFS is expected; my recollection is that Sun even marketed how great of a combination that was.

dfox · on March 20, 2024

Sun marketed the combination of PostgreSQL and ZFS quite heavily. But IIRC most of the success stories they presented involved what was decidedly non-OLTP read heavy workload with somewhat ridiculously large databases. One of the case studies even involved using PostgreSQL on ZFS as an archival layer for Oracle, with the observation that what is a large Oracle database is medium-sized one for PostgreSQL (and probably even more so if it is read mostly and on ZFS).

Sunhad some recommendations for tuning PostgreSQL performance on ZFS, but the recommendations seemed weird or even wrong for OLTP on top of CoW filesystem (I assume that these recommendations were meant for the above mentioned DWH/archival scenarios).

eru · on March 20, 2024

Interesting. The compression options that postgres offers itself were not enough?

yjftsjthsd-h · on March 20, 2024

It went the other way; we were using ZFS to protect against data errors, then found that it did compression as well. But looking now, the only native postgres options I see are TOAST, which seems to only work for certain data types in the database, and WAL compression (that's only existed since pg 15), so unless I've missed something I would tend to say yes it's far superior to the native options.

eru · on March 21, 2024

I mostly only looked into the options offered by MariaDB a while ago, and they seemed quite neat. I had just assumed that postgres was at least on par.

Thanks for reporting!

riku_iki · on March 20, 2024

postgres allows to compress only large text and blob column values.

pQd · on March 21, 2024

We're using BTRFS to host PostgreSQL and MySQL replication slaves. We're snapshoting drives holding data for both every 15 minutes, 1h, 8h and 12h and keep few snapshots for each frequency.

Those replicas are not used for any workload, besides nightly consistency checks for MySQLs via pt-table-checksum to ensure we don't have data drift.

Snapshots are crash consistent. Once in a while they give us ability to very quickly inspect how data looked like few minutes or hours ago. This can be life-saver in case of fat-fingering a production data and saved us from lenghty grepping of backups when we needed to recover few records from a specific table.

Yes, I know soft deletes, audit logs - all of those could help and we do have them, but sometimes that's not enough or not feasible.

Due to it's COW nature BTRFS is far from perfect for data that changes all the time [ databases busy with writes, images of VMs with plenty of disk write activity ]. There's plenty of write amplification, but that can be solved with NVMe drives thrown on the problem.

dilyevsky · on March 21, 2024

How do you avoid heavy fragmentation caused by random writes? Do you disable COW (sounds like "no", given you snapshot)? Or autodefrag (how's performance)?

chasil · on March 20, 2024

Only 12 pages. Oracle database on ZFS best practices.

https://www.oracle.com/technetwork/server-storage/solaris10/...

dilyevsky · on March 20, 2024

afaik meta (who are very large btrfs user) do not use it with mysql but do use it with rocksdb backends.

i think you can tweak it to make high random write load less painful but it generally will struggle with that.

throw0101d · on March 20, 2024

Given that (Open)ZFS[1] is quite mature, and Bcachefs[2] seems to be gaining popularity, how much of a future does Btrfs have?

[1] https://en.wikipedia.org/wiki/ZFS

[2] https://en.wikipedia.org/wiki/Bcachefs

nolist_policy · on March 20, 2024

I fully expect bcachefs will initially hit similar issues like btrfs.

Until now (bcachefs has been merged) it has only been used by people running custom kernels. As more people try it, it will hit more and more drives with buggy firmware and whatnot.

viraptor · on March 20, 2024

They're in a slightly different position though. Bcache itself has existed for many years and has been used in production. Bcachefs changes it quite a bit, but I wouldn't expect any basic issues we've seen elsewhere.

bhaney · on March 20, 2024

As a heavy btrfs user, I do expect bcachefs to fully replace it eventually. But that's still many years off.

opengears · on March 20, 2024

I big future (unless ZFS licensing incompatibility is solved)

throw0101d · on March 20, 2024

Why would Btrfs have a big(ger) future than Bcachefs when the latter seems to have the same functionality and less 'historical baggage'†?

I remember when Btrfs was initially released (I was doing Solaris sysadmining, and it was billed as "ZFS for Linux"), and yet here we are all these years later and it still seems to be 'meh'.

† E.g., RAID5+ that's less likely to eat data:

* https://btrfs.readthedocs.io/en/latest/btrfs-man5.html#raid5...

* https://bcachefs.org/ErasureCoding/ / https://github.com/koverstreet/bcachefs/issues/657

bhaney · on March 20, 2024

By the time future-bcachefs has feature parity with present-btrfs, who knows what more will be in future-btrfs?

For your specific example, bcachefs's erasure coding is very experimental and currently pretty much unusable, while btrfs is actively working towards fixing the raid56 write hole with the recent addition of the raid-stripe-tree. By the time bcachefs has a trustworthy parity profile, btrfs's may be just as good.

throw0101d · on March 20, 2024

> For your specific example

My specific example says that the bcachefs are "actively working towards fixing the raid56 write hole" as well—or rather, their way of doing things doesn't have one in the first place.

bhaney · on March 20, 2024

> bcachefs are "actively working towards fixing the raid56 write hole" as well

Yep, that's my point. Neither btrfs nor bcachefs have a write-hole-less parity raid profile implementation yet, and both are working towards one. We don't know if one will be finished and battle tested significantly before the other, or if one will prove to be more performant or reliable. Just have to wait and see.

curt15 · on March 20, 2024

Bcachefs has not advertised erasure coding as production ready only to renege on that claim later. So nobody has been unwittingly burned yet.

maxloh · on March 20, 2024

GPL only forbids ZFS to be distributed alongside Linux, it doesn't prevent users from installing it manually. (IANAL)

vladvasiliu · on March 20, 2024

In addition to what mustache_kimono said, there apparently seem to be issues with kernel functions moving / changing and breaking ZFS which then needs a while to catch up. For the latest reoccurrence of this, see https://github.com/openzfs/zfs/pull/15931 for Linux 6.8 compatibility.

There's also the fact that not everything is OOB compatible with ZFS. For example, newer versions of systemd have been able to use the TPM to unlock drives encrypted with LUKS. AFAIK it doesn't work with ZFS.

I use ZFS on my daily driver Linux box and mostly love it, but as long as these things happen, I can see why people may want to try to find an in-kernel solution. I personally use Arch so expect for bleeding edge updates to not work perfectly right away. But I recall seeing folks complaining about issues on Fedora, too, which I expect to be a bit more conservative than Arch.

didntcheck · on March 20, 2024

LUKS and the associated systemd hook shouldn't care about the filesystem, right? It's just block layer

But presumably you meant native ZFS encryption, which unfortunately is considered to be somewhat experimental and neglected, as I understand. Which surprised me, since I thought data at rest encryption would be pretty important for an "enterprise" filesystem

Still, apparently lots of people successfully run ZFS on LUKS. It does mean you don't get zero-trust zfs send backups, but apparently that's where a lot of the bugs have been anyway

vladvasiliu · on March 20, 2024

Yes, I was talking about ZFS native encryption.

> But presumably you meant native ZFS encryption, which unfortunately is considered to be somewhat experimental and neglected, as I understand. Which surprised me, since I thought data at rest encryption would be pretty important for an "enterprise" filesystem.

Yeah, I've happened about someone saying something similar, but I've never seen anything about that from "official" sources. Wouldn't mind a link or something if you have one on hand. But there is the fact that this encryption scheme seems limited when compared to LUKS: there's no support for multiple passphrases or using anything more convenient, like say, a U2F token.

> Still, apparently lots of people successfully run ZFS on LUKS. It does mean you don't get zero-trust zfs send backups, but apparently that's where a lot of the bugs have been anyway

I'd say I'm one of those people, never had any issue with this in ~ten years of use. And, indeed, the main reason for using this on my laptop is being able to send the snapshots around without having to deal with another tool for managing encryption. Also, on my servers on which I run RAIDZ, having to configure LUKS on top of each drive is a PITA.

mustache_kimono · on March 20, 2024

> GPL only forbids ZFS to be distributed alongside Linux

But does it even do that? You might be surprised when/if you read a little more widely. Position of OpenZFS project[0] is which I find persuasive (my emphasis added):

    In the case of the Linux Kernel, this prevents us from distributing OpenZFS as part of the Linux Kernel binary. *However, there is nothing in either license that prevents distributing it in the form of a binary module* or in the form of source code.

[0]: https://openzfs.github.io/openzfs-docs/License.html

You might see also:

[1]: https://www.networkworld.com/article/836039/smb-encouraging-...

[2]: https://law.resource.org/pub/us/case/reporter/F2/977/977.F2d...

lmz · on March 20, 2024

It doesn't matter what the project thinks as long as there is code owned by Oracle in the FS.

mustache_kimono · on March 20, 2024

> It doesn't matter what the project thinks as long as there is code owned by Oracle in the FS.

Might we agree then that the only thing that really matters is the law? And we should ignore other opinions coughlike from the FSF/the SFCcough which don't make reference to the law? Or which ignore long held copyright law principles, like fair use?

Please take a look at the case law. The so far theoretical claim of OpenZFS/Linux incompatibility is especially weak re: a binary kernel module.

wtallis · on March 20, 2024

What matters in practice isn't the law, but how much trouble Oracle could cause should the lawnmower veer in that direction. Even a complete rewrite of ZFS would have some risk associated with it given Oracle's history.

mustache_kimono · on March 20, 2024

> What matters in practice isn't the law, but how much trouble Oracle could cause should the lawnmower veer in that direction.

Veer in what direction? The current state of affairs re: Canonical and Oracle is a ZFS binary kernel module shipped with the Linux kernel. Canonical has literally done the thing that which you are speculating is impermissible. And, for 8 years, Oracle has done nothing. Oracle's attorneys have even publicly disagreed with the SFC that a ZFS binary kernel module violates the GPL or the CDDL.[0]

Given this state of affairs, the level of legal certainty re: this question is far greater than the legal certainty we have re: pretty much any other open IP question in tech.

What matters is practice, is that you stop your/the SFC's/the FSF's torrent of FUD.

> Even a complete rewrite of ZFS would have some risk associated with it given Oracle's history.

I'd ask "How?", but it'd be another torrent of "What if..."s.

[0]: https://youtu.be/PFMPjt_RgXA?t=2260

wtallis · on March 21, 2024

I said that what matters is not the law, to which you responded by doubling down on your argument about what the law says, accused me of disagreeing with you on an issue I did not take a position on, and then went for the personal attacks.

In case my lawnmower reference confused you enough that you were unable to make an appropriate response to the point I actually made, I'll try to state it a bit more clearly:

It does not matter how confident you are that Oracle would eventually lose a lawsuit over using or distributing ZFS with the Linux kernel. If Oracle decides to attempt to exert control over ZFS and interfere with the use or distribution of ZFS with Linux, they have ample resources to make a lot of very expensive trouble for various users and organizations. Oracle's history—most importantly, their history vs Google re Java in Android—means it would not be much of a stretch for them to decide to start behaving like The SCO Group. I do not think this risk is large. But I do think it is a real risk that a cautious Linux distro can reasonably be worried about.

If you truly believe that Oracle's lack of action thus far against ZFS on Linux and their public statements of their beliefs about the effects of the CDDL and GPL would prevent them from starting shit, then you are simply wrong about how our legal system works, and there are plenty of examples. The things you point to to bolster your arguments about what the law says are things that would make it hard for Oracle to win a lawsuit on its merits, but the eventual judgement is hardly the only thing that matters when assessing a legal risk—especially if your pockets are not as deep as Oracle's.

mustache_kimono · on March 21, 2024

> I do not think this risk is large. But I do think it is a real risk that a cautious Linux distro can reasonably be worried about.

The 2nd or 3rd most commercially important Linux distro has been using ZFS since 2016.

> If you truly believe that Oracle's lack of action thus far against ZFS on Linux and their public statements of their beliefs about the effects of the CDDL and GPL would prevent them from starting shit

I understand the argument "Oracle might do something", plus spooky magic fingers and creepy noises, all too well. Except it's not actually an argument. It might be best described as a boogeyman, sent to frighten little children into not running ZFS.

My point was: I think it's time for you to get over sleeping with the light on.

In 2000, we called this FUD when Microsoft did this. In 2024, we should know better, even when you're fronting for the FSF or the SFC.

maxloh · on March 20, 2024

Maybe replace Oracle-owned code with a clean room implementation?

arp242 · on March 20, 2024

They did that. It's called "btrfs".

A stable clean-room ZFS with on-disk compatibility would be a huge task. How long did stable NTFS write capability take? And NTFS is a much simpler filesystem. It would also be a huge waste of time given that btrfs and bcachefs exist, and that ZFS is fine to use license-wise – it's just distribution that's a tad awkward (but not overly so).

rascul · on March 20, 2024

Interesting to note here that btrfs came from Oracle.

tadfisher · on March 20, 2024

"Chris Mason is the founding developer of btrfs, which he began working on in 2007 while working at Oracle. This leads many people to believe that btrfs is an Oracle project—it is not. The project belonged to Mason, not to his employer, and it remains a community project unencumbered by corporate ownership to this day."

https://arstechnica.com/gadgets/2021/09/examining-btrfs-linu...

chungy · on March 21, 2024

Maybe it was Mason's pet project within the company, but there is no ambiguity that Oracle owns it. It is an Oracle project.

arp242 · on March 21, 2024

A copyright line doesn't make it an "Oracle project". That implies a high level of control/involvement in the project.

chungy · on March 21, 2024

It shows belonging, at the very least. The quote from Jim Salter was "The project belonged to Mason, not to his employer" (emphasis added). The copyright line demonstrably and incontestably refutes this claim. btrfs belongs to Oracle.

yjftsjthsd-h · on March 20, 2024

I don't think that would work; all of the changes since the fork are also CDDL and they aren't owned by any one entity/person. (IANAL)

DaSHacka · on March 20, 2024

Would you not basically be starting over at that point, though?

e145bc455f1 · on March 20, 2024

When can we expect debian to ship Bcachefs?

throw0101d · on March 20, 2024

Bcachefs is in the Linux 6.7 kernel, and that is available in Debian unstable and experimental:

* https://packages.debian.org/search?keywords=linux-image-6.7

* Search "6.7.1-1~exp1": https://metadata.ftp-master.debian.org/changelogs//main/l/li...

candiddevmike · on March 20, 2024

Bcachefs is not a drop-in replacement for btrfs yet. It's still missing critical things like scrub:

https://bcachefs.org/Roadmap/

apitman · on March 20, 2024

I really wish you could use custom filesystems such as btrfs with WSL2. I don't think there's currently any way to do snapshotting, which means you can never be sure a backup taken within WSL is corrupt.

cogman10 · on March 20, 2024

You can, it's just a bit of a pain.

https://blog.bryanroessler.com/2020-12-14-btrfs-on-wsl2/

apitman · on March 20, 2024

Nice. Unfortunately for my use case I can't use physical storage devices. I need a something similar to a qcow2 that can be created and completely managed by my app.

mappu · on March 20, 2024

Hope that https://github.com/veeam/blksnap/issues/2 becomes available soon. It's on v7 of posting to linux-block and will make snapshotting available for all block devices.

ShiftControl · on March 22, 2024

Well, sadly the latest patch submission haven't got too much traction (nada, to be precise) or attention from the maintainers:

https://patchwork.kernel.org/project/linux-block/cover/20240...

apitman · on March 20, 2024

This is the first I've heard of blksnap. Looks very interesting. There's not much documentation in the repo. Am I understanding correctly that if I were to build a custom WSL kernel with those patches, I would be able to do snapshotting in WSL today?

Can you give or link to a brief description of how blksnap works?

mappu · on March 21, 2024

You could do block device snapshotting today if you build your block devices on top of LVM. You can also somewhat do it with fsfreeze and remounting your mount point on top of a dm-snapshot target.

Blksnap is better because it does not require setup in advance like LVM, and it does not require interrupting any live users like fsfreeze. It "should" just work with the live block device within the WSL distro.

LWN covered the v1 posting, which is a good start: https://lwn.net/Articles/914031/

This is all somewhat future-looking, and if you only want file-level snapshotting instead of block-level, it's probably easier to try and get btrfs/bcachefs/nilfs2 instead. My WSL2 on Windows 11 shows btrfs present inside /proc/filesystems.

ShiftControl · on March 22, 2024

To piggyback on this:

In the 2024 century the default LVM setup still does not offer to leave some spare space on the volume group. Without that unallocated space it's impossible to create snapshots for LVs.

Oh, and the change-tracking function is non existent for both LVM and dev mapper!

ShiftControl · on March 22, 2024

I think this doc is pretty concise:

https://github.com/veeam/blksnap/blob/master/doc/blksnap.md

aseipp · on March 20, 2024

Hell, even just being able to use XFS would be an improvement, because ext4 has painful degradation scenarios when you hit cases like exhausting the inode count.

(Somewhat related, but there has been a WIP 6.1 kernel for WSL2 "in preview" for a while now... I wonder why it hasn't become the default considering both it and 5.12 are LTS... For filesystems like btrfs I often want a newer kernel to pick up every bugfix.)

chasil · on March 20, 2024

Can this be used? I knew ReactOS would use it natively.

https://github.com/maharmstone/btrfs

apitman · on March 20, 2024

Maybe that could be adapted, but I don't think it would solve my problem currently. Basically I want to be able to do btrfs snapshots from within my WSL distros, so that I can run restic or similar on the snapshots.

xyzzy_plugh · on March 20, 2024

I'm so sad. I was in the btrfs corner for over a decade, and it saddens me to say that ZFS has won. But it has.

And ZFS is actually good. I'm happy with it. I don't think about it. I've moved on.

Sorry, btrfs, but I don't think it's ever going to work out between us. Maybe in a different life.

doublepg23 · on March 20, 2024

I’m personally cheering for Bcachefs now.

matheusmoreira · on March 20, 2024

What saddens me is the fact they still haven't managed to put ZFS into the Linux kernel tree because of licensing nonsense.

different_base · on March 22, 2024

It's not Linux's fault for not merging ZFS. Blame Oracle.

matheusmoreira · on March 23, 2024

I know. I was blaming Oracle, and Sun before them. Why can't they just relicense the entire thing as GPL or some other license that's actually open source and GPL-compatible such as MIT? They own the copyright, don't they? Surely they have the power to do that.

ZFS is too important to not be in Linux. Yet it's 2024 and it's still not in there. Because of licensing copyright nonsense. The blame falls solely on the copyright owners.

mardifoufs · on March 21, 2024

> I'm so sad. I was in the btrfs corner for over a decade, and it saddens me to say that ZFS has won. But it has.

> And ZFS is actually good. I'm happy with it. I don't think about it. I've moved on.

> Sorry, btrfs, but I don't think it's ever going to work out between us. Maybe in a different life.

ZFS has had a worse bug not even 4 months ago. They are both very good filesystems, bugs happen

streb-lo · on March 20, 2024

As someone who uses a rolling release, I use btrfs because I don't want to deal with keeping ZFS up to date.

It's been really good for me. And btrbk is the best backup solution I've had on Linux, btrfs send/receive is a lot faster than rsync even when sending non-incremental snapshots.

kccqzy · on March 20, 2024

Same here: I use a rolling release and btrfs. Personally I really enjoy btrfs's snapshot feature. Most of the time when I need backups it's not because of a hardware failure but because of a fat finger mistake where I rm'ed a file I need. Periodic snapshots completely solved that problem for me.

(Of course, backing up to another disk is absolutely still needed, but you probably need it less than you think.)