More

man8alexd · 2026-02-07T07:01:50 1770447710

I have updated my collection of falsehoods people believe about Linux swap and OOM https://alexeydemidov.com/2025/05/15/falsehoods-people-and-L...

man8alexd · 2026-02-06T04:42:03 1770352923

> There is no penalty for giving a system too much swap (apart from disk space)

There is a huge penalty for having too much swap - swap thrashing. When the active working set exceeds physical memory, performance degrades so much that the system becomes unresponsive instead of triggering OOM.

> Monitor it occasionally, particularly if your system slows down.

Swap doesn't slow down the system. It either improves performance by freeing unused memory, or it is a completely unresponsive system when you run out of memory. Gradual performance degradation never happens.

> give your system so much swap you are sure it exceeds the size of stuff that's running but not used. 4Gb is probably fine for a desktop.

Don't do this. Unless hibernation is used, you only need a few hundred megabytes of free swap space.

rstuart4133 · 2026-02-07T22:39:29 1770503969

> There is a huge penalty for having too much swap - swap thrashing.

Thrashing is the penality for using too much swap. I was saying there is no penality for having a lot of swap available, but unused.

Although trashing is not something you want happening, if your system is thrashing with swap the alternative without having it available is the OMM killer laying waste to the system. Out of those two choices I prefer the system running slowly.

> Gradual performance degradation never happens.

Where on earth did you get that from? It's wrong most of the time. The subject was very well researched in the late 1960's and 1970's. If load ramps up gradually you get a gradual slowdown until the working set is badly exceeded, then it falls off a cliff. This is a modern example, but there are lots of papers from that era showing the usual gradual response, followed by falling off a cliff: https://yeet.cx/r/ayNHrp5oL0. A seminal paper on the subject: https://dl.acm.org/doi/pdf/10.1145/362342.362356

The underlying driver for that behaviour is the disk system being overwhelmed. Say you have 100 web workers that that spend a fair chunk of their time waiting for networked database requests. If they all fit in memory the response is as fast as it can be. Once swapping starts latency increases gradually as more and more workers are swapped in and out while they wait for clients and the database. Eventually the increasing swapping hits the disk's IOPS limit, active memory is swapped out and performance crashes.

The only reason I can think the gradual slow down is not obvious to you is that modern SSD's are so fast, the initial degradation it's not noticeable to desktop user.

> Don't do this. Unless hibernation is used, you only need a few hundred megabytes of free swap space.

A you seem to recognise having lots of swap on hand and unused, even it it's terabytes of it does not effect performance. The question then becomes: what would you prefer to happen in those rare times when swap usage exceeds the optimal few hundred megabytes? Your options are get your desktop app randomly killed by the OOM killer and perhaps lose your work, or the system slows to a crawl and you take corrective action like closing the offending app. When that happens it seems it's popular to blame the swap system for slowing their system down because they temporarily exceeded the capacity of their computer.

man8alexd · 2026-02-08T06:47:12 1770533232

> Thrashing is the penality for using too much swap. I was saying there is no penality for having a lot of swap available, but unused.

Unless you overprovision memory on a machine or have carefully set cgroup limits for all workloads, you are going to have a memory leak and your large unused swap is going to be used, leading to swap thrashing.

> the OMM killer laying waste to the system. Out of those two choices I prefer the system running slowly.

In a swap thrashing event, the system isn't just running slowly but totally unresponsive, with an unknown chance of recovery. The majority of people prefer OOM killer to an unresponsive system. That's why we got OOM killer in the first place.

> If load ramps up gradually you get a gradual slowdown until the working set is badly exceeded, then it falls off a cliff.

Random access latency difference between RAM and SSD is 10^3. When the active working set spills out into swap, liner increase of swap utilization leads to exponential performance degradation. Assuming random access, simple math gives that 0.1% excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

> A seminal paper on the subject: https://dl.acm.org/doi/pdf/10.1145/362342.362356

This paper discusses measuring stable working sets and says nothing about performance degradation when your working set increases.

> https://yeet.cx/r/ayNHrp5oL0.

WTF is this graph supposed to demonstrate? Some workload went from 0% to 100% of swap utilization in 30 seconds and got OOM-killed. This is not going to happen with a large swap.

> Once swapping starts latency increases gradually as more and more workers are swapped in and out while they wait for clients and the database

In practice, you never see constant or gradually increasing swap I/O in such systems. You either see zero swap I/O with occasional spikes due to incoming traffic or total I/O saturation from swap thrashing.

> Your options are get your desktop app randomly killed by the OOM killer and perhaps lose your work, or the system slows to a crawl and you take corrective action like closing the offending app.

You seem to be unaware that swap thrashing events are frequently unrecoverable, especially with a large swap. It is better to have a typical culprit like Chrome OOM-killed than to press the reset button and risk filesystem corruption.

rstuart4133 · 2026-02-08T12:01:06 1770552066

> Unless you overprovision memory on a machine or have carefully set cgroup limits for all workloads, you are going to have a memory leak and your large unused swap is going to be used, leading to swap thrashing.

You seem to be very certain about that inevitable memory leak. I guess people can make their own judgements about how inevitable they are. I can't say I've seen a lot of them myself.

But the next bit is total rubbish. A memory leak does not lead to thrashing. By definition if you have a leak the memory isn't used, so it goes to swap and stays there. It doesn't thrash. What actually happens if the leak continues is swap eventually fills up, and then the OOM killer comes out to play. Fortunately it will likely kill the process that is leaking memory.

I've used this behaviour to find which process had a slow leak (it had to be running for months). This has only happened once in decades mind you - these leaks aren't that common. You allocate a lot of swap, and gradually it is filled by the process that has the leak. Because swap is so large once the process leaking memory fills it, it stands out like dogs balls because it's memory consumption is huge.

You notice all of this because, like all good sysadmins, you monitor swap usage and receive alerts when it gets beyond what is normal. But you have time - the swap is large, the system slows down during peaks but recovers when they are over. It's annoying, but not a huge issue.

> In a swap thrashing event, the system isn't just running slowly but totally unresponsive

Again, you are seem to be very certain about this. Which is odd, because I've logged into systems that were thrashing which means they didn't meet my definition of "totally unresponsive". In fact I could only log in because the OOM killer had freed some memory. The first couple of times the OOM killer took out sshd and I had to each for the reset button, but I got lucky one day and could log in. The system was so slow it was unusable for most purposes - but not for the one thing I needed, which was to find out why it had run out of memory. Maybe we have different definitions of "totally", but to me that isn't "totally". In fact if you catch it before the OOM killer fires up and kills god knows what, these "totally unresponsive systems" are salvageable without a reboot.

> This paper discusses measuring stable working sets and says nothing about performance degradation when your working set increases.

Fair enough. Neither link was good.

> You seem to be unaware that swap thrashing events are frequently unrecoverable, especially with a large swap.

Perhaps some of them are, but for me it wasn't the swapping that did the system in. It is always the OOM killer.

> It is better to have a typical culprit like Chrome OOM-killed than to press the reset button and risk filesystem corruption.

The OOM killer on the other hand leaves the system in some undefined state. Some things are dead. Maybe you got lucky and it was just Chrome that was killed, but maybe your sound, bluetooth, or DNS daemons have gone AWOL and things just behave weirdly. Despite what you say, the reset button won't corrupt modern journaled filesystems as they are pretty well debugged. But applications are a different story. If they get hit by a reset or the OOM killer while they are saving your data and aren't using sqlite as their "fopen()", they can wipe the file you are working on. You don't just lose the changes. The entire document is gone. This has happened to me.

I'd take the system taking a few minutes to respond to my request to kill a misbehaving application over the OOM killer any day.

man8alexd · 2026-02-08T13:30:35 1770557435

> You seem to be very certain about that inevitable memory leak.

It is fashionable to disable swap nowadays because everyone has been bitten by a swap thrashing event. Read other comments.

> A memory leak does not lead to thrashing. By definition if you have a leak the memory isn't used, so it goes to swap and stays there.

You assume that leaked memory is inactive and goes to swap. This is not true. Chrome, Gnome, whatever modern Linux desktop apps leak a lot, and it stays in RSS, pushing everything else into swap.

> if the leak continues is swap eventually fills up, and then the OOM killer comes out to play

You assume that the OOM killer comes out to play in time. The larger the swap, the longer it takes for the OOM killer to trigger, if ever, because the kernel OOM-killer is unreliable, so we have a collection of other tools like earlyoom, Facebook oomd and systemd-oomd.

> I've logged into systems that were thrashing

It means that the system wasn't out of memory yet. When it is unresponsive, you won't be able to enter commands into an already open shell. See other comments here for examples.

> The OOM killer on the other hand leaves the system in some undefined state. Some things are dead. Maybe you got lucky and it was just Chrome that was killed, but maybe your sound, bluetooth, or DNS daemons have gone AWOL and things just behave weirdly.

This is not true. By default, the kernel OOM-killer selects one single largest (measured by its RSS+swap) process in the system. By default, systemd, ssh and other socket-activated systemd units are protected from OOM.

rstuart4133 · 2026-02-08T21:49:41 1770587381

> It is fashionable to disable swap nowadays because everyone has been bitten by a swap thrashing event.

If they disable swap they will get hit by the OOM killer. You seem to prefer it over slowing down. I guess that's a personal preference. However, I think it is misleading to say people are being bitten by a swap thrashing event. The "event" was them running out of RAM. Unpleasant things will happen as a consequence. Blaming thrashing or the OOM killer for the unpleasant things is misleading.

> You assume that leaked memory is inactive and goes to swap. This is not true.

At best, you can say "it's not always true". It's definitely gone to swap in every case I've come across.

> It means that the system wasn't out of memory yet.

Of course it wasn't out of memory. It had lots of swap. That's the whole point of providing that swap - so you can rescue it!

> When it is unresponsive, you won't be able to enter commands into an already open shell.

Again that's just plain wrong. I have entered commands into a system is trashing. It must work eventually if thrashing is the only thing going on, because when the system thrashes the CPU utilization doesn't go to 0. The CPU is just waiting for disk I/O after all, and disk I/O is happening at a furious pace. There's also a finite amount of pending disk I/O. Provided no new work is arriving (time for a cup of coffee?) it will get done, and the thrashing will end.

If the system does die other things have happened. Most likely the OOM killer if they follow your advice, but network timeouts killing ssh and networked shares are also a thing. If you are using Windows or MacOS, the swap file can grow to fill most of free disk space, so you end up with a double whammy.

Which brings me to another observation. In desktop OS's, the default is to provide it, and lots of it. In Windows swap will grow to 3 times RAM. This is pretty universal - even Debian will give you twice RAM for small systems. The people who decided on that design choice aren't following some folk law on they read in some internet echo chamber. They've used real data, they've observed when swapping starts being used systems do slow down giving the user some advance warning, when thrashing starts systems can recover rather than die which gives the user opportunity to save work. It is the right design tradeoff IMO.

> By default, the kernel OOM-killer selects one single largest (measured by its RSS+swap) process in the system.

Yes, it does. And if it is a single large process hogging memory you are in luck - the OOM killer will likely do the right thing. But Chrome (and now Firefox) is not a single large process. Worse if the out of memory is caused by say someone creating zillions of logins, they are so small they are the last thing the OOM killer chooses. Shells, daemons, all sorts of critical things go first. The "largest" process first is just a heuristic, one which can be and in my case has been wrong. Badly wrong.

man8alexd · 2026-02-05T14:17:25 1770301045

There is cgroup memory.min

man8alexd · 2026-02-05T14:14:18 1770300858

There is no actual swapping in the modern kernels. Nowadays, it is paging, when the kernel pages out individual unused memory pages, not entire processes, so it keeps all non-blocked processes running, but only necessary memory pages in the memory.

man8alexd · 2026-02-05T12:42:49 1770295369

> However in almost all normal cases, this grossly overestimates the required memory and thus leads to swapping when technically it is not needed.

This is not true. Disabling overcommit doesn't change reclaim and swapping behaviour and doesn't lead to unnecessary swapping.

magicalhippo · 2026-02-05T20:59:10 1770325150

> This is not true.

Yeah that wasn't correct. It will however cause the kernel to refuse memory allocations[1] which could have been allowed, and a lot of programs don't handle that gracefully.

[1]: https://www.kernel.org/doc/html/v6.13/mm/overcommit-accounti...

man8alexd · 2026-02-05T12:01:38 1770292898

When you set cgroup limits, you tell the kernel how to determine when a process is misbehaving and needs to be OOM-killed.

man8alexd · 2026-02-05T11:53:32 1770292412

swap causes thrashing if you have too large swap and no cgroup limits.

man8alexd · 2026-02-05T11:53:14 1770292394

It is possible to measure process memory utilitsation and set appropriate cgroup limits.

man8alexd · 2026-02-05T11:23:53 1770290633

Swap is not a replacement for RAM. It is not just slow. It is very-very-very slow. Even SSDs are 10^3 slower at random access with small 4K blocks. Swap is for allocated but unused memory. If the system tries to use swap as active memory, it is going to become unresponsive very quickly - 0.1% memory excess causes a 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

AtlasBarfed · 2026-02-05T15:14:29 1770304469

What is allocated but unused memory? That sounds like memory that will be used in the near future and we are scheduling in an annoying disk load when it is needed

You are of course highlighting the problem that virtual addressing was intended to over abstract memory resource usage, but it provides poor facilities for power users to finely prioritize memory usage.

The example of this is game consoles, which didn't have this layer. Game writers had to reserve parts of ram fur specific uses.

You can't do this easily in Linux afaik, because it is forcing the model upon you.

man8alexd · 2026-02-05T15:33:31 1770305611

Unused or Inactive memory is memory that hasn't been accessed recently. The kernel maintains LRU (least recently used) lists for most of its memory pages. The kernel memory management works on the assumption that the least recently used pages are least likely to be accessed soon. Under memory pressure, when the kernel needs to free some memory pages, it swaps out pages at the tail of the inactive anonymous LRU.

Cgroup limits and OOM scores allow to prioritize memory usage on a per-process and per-process group basis. madvise(2) syscall allows to prioritize memory usage within a process.

man8alexd · 2026-02-05T11:11:36 1770289896

Binaries and libraries are not paged out. Being read-only, they are simply discarded from the memory. And I'll repeat, actively used executable pages are explicitly excluded from reclaim and never discarded.