I think preventing alarm fatigue is a very good reason to fix issues. But 5% fre...

j1elo · 2025-04-19T13:52:40 1745070760

5% of my 500 GB is 25 GB, which is already a lot of space but understandable. Not many things would fit in there nowadays.

But 5% of a 5 TB volume is 250 GB, that's the size of my whole system disk! Probably not so understandable by the lay person.

kbolino · 2025-04-19T14:39:01 1745073541

This is partly why SSDs just lie nowadays and tell you they only have 75-90% of the capacity that is actually built into them. You can't directly access that excess capacity but the drive controller can when it needs to (primarily to extend the life of the drive).

Some filesystems do stake out a reservation but I don't think any claim one as large as 5% (not counting the effect of fixed-size reservations on very small volumes). Maybe they ought to, as a way of managing expectations better.

For people who used computers when the disks were a lot smaller, or who primarily deal in files much much smaller than the volumes they're stored on, the absolute size of a percentage reservation can seem quite large. And, in certain cases, for certain workloads, the absolute size may actually be more important than the relative size.

But most file systems are designed for general use and, across a variety of different workloads, spare capacity and the impact of (not) keeping it open is more about relative than absolute sizes. Besides fragmentation, there's also bookkeeping issues, like adding one more file to a directory cascading into a complete rearrangement of the internal data structures.

fc417fc802 · 2025-04-22T08:55:59 1745312159

> sare capacity and the impact of (not) keeping it open is more about relative than absolute sizes

I don't think this is correct. At least btrfs works with slabs in the 1 GB range IIRC.

One of my current filesystmes is upwards of 20 TB. Reserving 5% of that would mean reserving 1 TB. I'll likely double it in the near future, at which point it would mean reserving 2 TB. At least for my use case those numbers are completely absurd.

kbolino · 2025-04-22T14:47:37 1745333257

We're not talking about optical discs or backup tapes which usually get written in full in a single session. Hard drive storage in general use is constantly changing.

As such, fragmentation is always there; absolute disk sizes don't change the propensity for typical workloads to produce fragmentation. A modern file system is not merely a bucket of files, it is a database that manages directories, metadata, files, and free space. If you mix small and large directories, small and large files, creation and deletion of files, appending to or truncating from existing files, etc., you will get fragmentation. When you get close to full, everything gets slower. Files written early in the volume's life and which haven't been altered may remain fast to access, but creating new files will be slower, and reading those files afterward will be slower too. Large directories follow the same rules as larger files, they can easily get fragmented (or, if they must be kept compact, then there will be time spent on defragmentation). If your free space is spread across the volume in small chunks, and at 95% full it almost certainly will be, then the fact that the sum of it is 1 TB confers no benefit by dint of absolute size.

Even if you had SSDs accessed with NVMe, fragmentation would still be an issue, since the file system must still store lists or trees of all the fragments, and accessing those data structures still takes more time as they grow. But most NAS setups are still using conventional spinning-platter hard drives, where the effects of fragmentation are massively amplified. A 7200 RPM drive takes 8.33 ms to complete one rotation. No improvements in technology have any effect on this number (though there used to be faster-spinning drives on the market). The denser storage of modern drives improves throughput when reading sequential data, but not random seek times. Fragmentation increases the frequency of random seeks relative to sequential access. Capacity issues tend to manifest as performance cliffs, whereby operations which used to take e.g. 5 ms suddenly take 500 or 5000. Everything can seem fine one day and then not the next, or fine on some operations but terrible on others.

Of course, you should be free to (ab)use the things you own as much as you wish. But make no mistake, 5% free is deep into abuse territory.

Also, as a bit of an aside, a 20 TB volume split into 1 GB slabs means there's 20,000 slabs. That's about the same as the number of 512-byte sectors in a 10 MB hard drive, which was the size of the first commercially available consumer hard drive for the IBM PC in the late 1980s. That's just a coincidence of course, but I find it funny that the numbers are so close.

Now, I assume the slabs are allocated from the start of the volume forward, which means external slab fragmentation is nonexistent (unless slabs can also be freed). But unless you plan to create no more than 20,000 files, each exactly 1 GB in size, in the root directory only, and never change anything on the volume ever again, then internal slab fragmentation will occur all the same.

fc417fc802 · 2025-04-22T21:14:42 1745356482

Yes thank you I am aware of what fragmentation is.

There are two sorts of fragmentation that can occur with btrfs. Free space and file data. File data is significantly more difficult to deal with but it "only" degrades read performance. It's honestly a pretty big weakness of btrfs. You can't realistically defragment file data if you have a lot of deduplication going on because (at least last I checked) the tooling breaks the deduplication.

> If your free space is spread across the volume in small chunks, and at 95% full it almost certainly will be

Only if you failed to perform basic maintenance. Free space fragmentation is a non-issue as long as you run the relevant tooling when necessary. Chunks get compacted when you rebalance.

Where it gets dicey is that the btrfs tooling is pretty bad at handling the situation where you have a small absolute number of chunks available. Even if you theoretically have enough chunks to play musical chairs and perform a rebalance the tooling will happily back itself into a corner through a series of utterly idiotic decisions. I've been bitten by this before but in my experience it doesn't happen until you're somewhere under 100 GB of remaining space regardless of the total filesystem size.

kbolino · 2025-04-22T21:51:04 1745358664

If compaction (= defragmentation) runs continuously or near-continuously, it results in write amplification of 2x or more. For a home/small-office NAS (the topic at hand) that's also lightly used with a read-heavy workload, it should be fine to rely on compaction to keep things running smoothly, since you won't need it to run that often and you have cycles and IOPS to spare.

If, under those conditions, 100 GB has proven to be enough for a lot of users, then it might make sense to add more flexible alarms. However, this workload is not universal, and setting such a low limit (0.5% of 20 TB) in general will not reflect the diverse demands that different people put on their storage.

kimixa · 2025-04-22T06:12:12 1745302332

Also Synology use btrfs, a copy-on-write filesystem - that means there are operations that you might not expect that require allocation of new blocks - like any write, even if overwriting an existing file's data.

And "unexpected" failure paths like that are often poorly tested in apps.

leptons · 2025-04-22T08:35:06 1745310906

No matter how many TB of online HD storage I have, hard disks are just a temporary buffer for my tape drives.

buserror · 2025-04-22T14:00:41 1745330441

At home I have a 48xLTO5 changer with 4 drives (I picked for a song a while back! I actually don't need it but heck, it has a ROBOT ARM), and at work I'm currently provisioning a 96 LTO 9 tape drive dual-rack. With 640 tapes available :-)

I'm a STRONG believer in tapes!

Even LTO 5 gives you a very cheap 1.5TB of clean, pretty much bulletproof storage.. You can pick a drive (with a SAS HBA card) for less than $200, there is zero driver issue (SCSI, baby); the linux tape changer code is stable since 1997 (with a port to VMS!).

Tape FTW :-)

leptons · 2025-04-23T16:57:26 1745427446

I don't have one but I'd definitely take a tape changer if it weren't too expensive. It would be amazing to have 72TB of storage just waiting to be filled, without needing to go out into my garage to load a tape up.

LTO tapes have really changed my life, or at least my mental health. Easy and robust backup has been elusive. DVD-R was just not doing it for me. Hard drives are too expensive and lacked robustness. My wife is a pro photographer so the never-ending data dumps had filled up all our hard drives, and spending hundreds of dollars more on another 2-disk mirror RAID, and then another, and another was just stupid. Most of the data will only need to be accessed rarely, but we still want to keep it. I lost sleep over the mountains of data we were hoarding on hard drives. I've had too many hard drives just die, including RAIDs being corrupted. LTO tape changed all of that. It's relatively cheap, and pretty easy and fast compared to all the other solutions. It's no wonder it's still being used in data centers. I love all the data center hand-me-downs that flood eBay.

And I do love hearing the tapes whir, it makes me smile.

worthless-trash · 2025-04-22T10:19:50 1745317190

This is an area that i'm quickly growing into, what are you curently using and what should I stay away from ?

leptons · 2025-04-22T19:09:13 1745348953

I got a used internal LTO5 tape drive on eBay for about $150, and then an HBA card to connect it to for about $25 or $30. I bought some LTO5 tapes, and typically I pay about $3.50/TB on eBay for new/used tapes. Many sellers charge far more for tapes, but occasionally I find a good deal. Most tapes are not used very much and have lots of life left in them (they have a chip inside the tape that tracks usage).

Then I scored another 3 used LTO5 tape drives on eBay for about $100, they all worked. I mainly use 1 tape drive. I have it running on an Intel i5 system with an 8-drive RAID10 array (cheap used drives, with a $50 9260-8i hardware RAID card), which acts as my "offsite" backup out in my detached garage - it's off most of the time (cold storage?) unless I'm running a backup. I can loose up to 2 drives without losing any data, and it's been running really well for years. I have 3 of these RAID setups in 3 different systems, they work great with the cheapest used drives from Amazon. I'm not looking for high performance, I just need redundancy. I've had to replace maybe 3 drives across all 3 systems due to failure over the last 7 years.

On Windows the tape drive with LTFS was not working well, I think due to Windows Defender trying to test the files as it was writing them, causing a lot of "shoeshining" of the tape, but I think Windows Defender can be disabled. But I bought tape backup software from https://www.iperiusbackup.com - it just works and makes backups simple to set up and run. I always verify the backup. If something is really important I'll back up to at least 2 tapes. Some really important stuff I will generate parity files (win WinPar) and put those on tape too. Non-encrypted the drive runs at the full 140MB/s, but with encryption it runs at about 60MB/s, because I guess the tape drive is doing the encryption.

I love it, it has changed my data-hoarding life. At $3.50/TB and 140MB/s and 1.5TB per tape, it can't be beat by DVD-R or hard drives for backup. Used LTO5 is really in a sweet spot right now on eBay, but LTO6 is looking good too recently (2.5TB/tape). LTO6 drives can read LTO5 tapes, so there's a pretty easy upgrade path. I also love that there is a physical write-protect switch on the tapes, which hard drives don't have. If you plug in a hard drive to an infected system, that hard drive could easily be compromised if you don't know your system is infected.

worthless-trash · 2025-04-24T14:18:52 1745504332

Thank you for the detailed write up, you've started me on the path!

runamok · 2025-04-19T14:49:16 1745074156

100%. Those disks are likely working much harder moving the head all over the place to find those empty spaces when it writes.

gambiting · 2025-04-22T05:39:23 1745300363

....do you think the drive doesn't know where the empty space actually is?

MrDrMcCoy · 2025-04-22T07:45:15 1745307915

Drives blindly store and retrieve blocks wherever you tell them, with no awareness of how or if they relate to one another. It's a filesystem's job to keep track of what's where. Filesystems get fragmented over time, and especially as they get full. The more full they get, the more seeking and shuffling they have to do to find a place to write stuff. This will be the case even after the last spinning drive rusts out, as even flash eventually has to contend with fragmentation. Heck, even RAM has to deal with fragmentation. See the discussion from the last few weeks about the ongoing work to figure out a contiguous memory allocator in Linux. It's one of the great unsolved problems in general comparing that you and your descendants would be set for life if you could solve.

akx · 2025-04-22T10:50:19 1745319019

Not quite, AFAIK? Drive controllers may internally remap blocks to physical disk blocks (e.g. when a bad sector is detected; see the SMART attribute Reallocated Sector Count).

kbolino · 2025-04-22T14:09:45 1745330985

Logical Block Addressing (LBA) by its very nature provides no hard guarantees about where the blocks are located. However, the convention that both sides (file systems and drive controllers) recognize is that runs of consecutive LBAs generally refer to physically contiguous regions of the underlying storage (and this is true for both conventional spinning-platter HDDs as well as most flash-based SSDs). The protocols that bridge the two sides (like ATA, SCSI, and NVMe) use LBA runs as the basic unit of accessing storage.

So while block remapping can occur, and the physical storage has limits on its contiguity (you'll eventually reach the end of a track on a platter or an erasable page in a flash chip), the optimal way to use the storage is to put related things together in a run of consecutive LBAs as much as possible.

MrDrMcCoy · 2025-04-22T17:19:40 1745342380

Sure, but bad block tracking and error correction are pretty different from the implied file/volume awareness I was responding to.

kbolino · 2025-04-22T17:54:46 1745344486

Yes, to be clear, the drive controller generally (*) has no concept of volumes or files, and presents itself to the rest of the computer as a flat, linear collection of fixed-size logical blocks. Any additional structure comes from software running outside the drive, which the drive isn't aware of. The conventional bias that adjacent logical blocks are probably also adjacent physical blocks merely allows the abstraction to be maintained while also giving the file system some ability to encourage locality of related data.

* = There are some exceptions to this, e.g. some older flash controllers were made that could "speak" FAT16/32 and actually know if blocks were free or not. This particular use was supplanted by TRIM support.

genewitch · 2025-04-22T07:15:49 1745306149

I think you'll find that the word "find" doesn't mean "has to search", like one can find their nose in the middle of their face, if one desires.

Change the word to "seek" and it may make more sense.

fc417fc802 · 2025-04-22T08:59:47 1745312387

It makes more sense but it's not true for the modern CoW filesystems that I'm familiar with. Those allocate free space in slabs that they write to sequentially.

kbolino · 2025-04-22T17:34:21 1745343261

Also, CoW isn't some kind of magic. There are two meanings I can think of here:

A) When you modify a file, everything including the parts you didn't change is copied to a new location. I don't think this is how btrfs works.

B) Allocated storage is never overwritten, but modifying parts of a file won't copy the unchanged parts. A file's content is composed of a sequence (list or tree) of extents (contiguous, variable-length runs of 1 or more blocks) and if you change part of the file, you first create a new disconnected extent somewhere and write to that. Then, when you're done writing, the file's existing extent limits are resized so that the portion you changed is carved out, and finally the sequence of extents is set to {old part before your change}, {your change}, {old part after your change}. This leaves behind an orphaned extent, containing the old content of the part you changed, which is now free. From what evidence I can quickly gather, this is how btrfs works.

Compared to an ordinary file system, where changes that don't increase the size of a file are written directly to the original blocks, it should be fairly obvious that strategy (B) results in more fragmentation, since both appending to and simply modifying a file causes a new allocation, and the latter leaves a new hole behind.

While strategy (A) with contiguous allocation could eliminate internal (file) fragmentation, it would also be much more sensitive to external (free space) fragmentation, requiring lots of spare capacity and/or frequent defrag.

Either way, the use of CoW means you need more spare capacity, not less. It's designed to allow more work to be done in parallel, as fits modern hardware and software better, under the assumption that there's also ample amounts of extra space to work with. Denying it that extra space is going to make it suffer worse than a non-CoW file system would.

fc417fc802 · 2025-04-22T21:22:03 1745356923

Which is exactly why you periodically do maintenance to compact the free space. Thus it isn't an issue in practice unless you have a very specific workload in which case you should probably be using a specialized solution. (Although I've read that apparently you can even get a workload like postgres working reasonably well on zfs which surprises me.)

If things get to the point where there's over 1 TB of fragmented free space on a filesystem that is entirely the fault of the operator.

kbolino · 2025-04-22T22:21:35 1745360495

What argument are you driving at here? The smaller the free space, the harder it is to run compaction. The larger the free space, the easier it is. There are some confounding forces in certain workloads, but the general principle stands.

"Your free space shouldn't be very fragmented when you have such large amounts free!" is exactly why you should keep large amounts free.

kbolino · 2025-04-22T15:04:19 1745334259

If you delete files, or append to existing files, then the promises of the initial allocation strategy go out the window.