Hacker News new | past | comments | ask | show | jobs | submit login

Engineering is a compromise. The article shows most gains come from specialising the memory allocater. The thing to remember is that some projects are multithreaded, and allocate in one thread, use data in another and maybe deallocate in a 3rd. The allocator needs to handle this. So a speedup for one project may be a crash in another.

Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.

I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

As already pointed out, engineering is a compromise.




Mimalloc is a general purpose allocator like JEMalloc / TCMalloc. Glibc is known to have a pretty bad allocator that modern allocators like MIMalloc & the latest TCMalloc (not the one available by default in Ubuntu) run laps around. While of course speedup may be variable, generally the benchmarks show an across the board speedup (whether that matters for any given application is something entirely different). As for crashes, these are all general purpose multi-thread allocators and behave no differently from glibc (modulo bugs that can exist equally in glibc).


Agree for most short running apps. I updated my comment to reflect issues with apps that are constantly reallocating, and running for longer that 60 seconds. But you are absolutely correct for most short running apps, 99% recommended to replace glibc. However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice. Hence why its the default, since there are less crashes in pathaligical cases. And the devs are exhausted dealing with crashing bugs which can be eliminated by using the slower allocator.


Looked at your updated post and it looks like you’re operating under wildly incorrect assumptions.

1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.

2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).

> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.

I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.


I tried to use mimalloc and snmalloc and in both cases got crashes I don't get with glibc when interoperating with other libraries (libusb, jack, one that I suspect to be in the Nvidia driver) :(


If you are not properly overriding the allocator consistently for everything within an executable, that’s entirely possible (eg linking against 1 allocator and then linking with a dynamic library that’s using a different one). Without a specific repro it’s hard to distinguish PEBCAK from legit bug. Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.


The explanation here is usually simpler. Any major change like this can lead to crashes.

If I have three libraries, biz, bar, and bif, and each has bugs. You've used biz. When the system was unstable, you'd debug until it works (either by finding the real bug, or by an innocuous change).

If you switch libraries, the bugs which go away are the ones which aren't affecting you, since your software works. On the other hand, you have a whole new set of bugs.

This comes up when upgrading libraries too. More bugs are usually fixed than introduced, but there's often a debug cycle.


> Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.

A huge chunk of a modern GPU driver is part of the calling process, loaded like a regular library. Just spot checking Chrome's GPU thread, there's dozens of threads created by a single 80+mb nvidia DLL. And this isn't unusual, every GPU driver has massive libraries loaded into the app using the GPU - often including entire copies of LLVM for things like shader compilers.


Challenge yourself to produce some numbers first. If there are many many many benchmarks it shouldn't be too difficult to link one. Just saying something is "well established" doesn't really help without some other context.



Oddly enough I've written about those results before:

https://www.dropbox.com/scl/fi/evnn6yoornh9p6l7nq1t9/Irrepro...


but we all can agree that glibc is trash right? :)

I’d be careful extrapolating from just one benchmark but generally if I had to choose I’d pick the new tcmalloc if I could. It seems to be a higher quality codebase.


The problem with the current tcmalloc, which I agree is very good, is the difficulty of integrating it with random builds. It works well for large systems and especially those that are already bazelized, but I couldn't whip up a quick way to link it to `jq` in this gist.


Thanks for that. I have an innate skepticism of benchmarks from a company who wants to show their solution is best, and it seems their latest results are from 2021, but I couldn't find a better comparison myself either.

I do note that rpmalloc (old), Hermes (not public? doesn't compare with mimalloc) and also snmalloc (also Microsoft) have benchmarks of their own showing themselves to be best in some circumstances.

https://github.com/mjansson/rpmalloc-benchmark

https://arxiv.org/abs/2109.02922

https://github.com/SchrodingerZhu/bench_suite/blob/master/ou...


More than that, based on the basic architecture of JEmalloc, mimalloc and tcmalloc, you should always expect their fragmentation behavior to be better for long-running software than the glibc malloc. (At the expense of consuming substantially more memory for very small programs with only a few allocations of a given size). The glibc malloc has nearly pessimal fragmentation behavior, you are very confused here.


> you are very confused here

Is this addressed to me? If so would you do me the kindness of attempting to disabuse me of my confusion?


Wait, no, sorry. That reply was supposed to go to a different post.


> This just simply is not something these allocators suffer from and would be major bugs the projects would solve.

Not OP, but the following logic shows why this claim is bogus. In short: If two non-garbage-collecting memory allocators do anything differently -- other than behave as perfect "mirror images" of each other, so that whenever one allocates byte i, the other allocates byte totalMem-i -- then there exists a program that crashes on one but not the other, and vice versa.

In detail:

If 2 allocators do not allocate exactly the same blocks of memory to the same underlying sequence of malloc() or free() calls, then there exists a program which, if built twice, once using each allocator, and then each executable is run with the same input, will after some time produce different patterns of memory fragmentation.

The first time this difference appears -- let's say, after the first n calls to either malloc() or free() -- the two executables will have the same total number of bytes allocated, but the specific ranges of allocated bytes will be different. The nth such call must be a malloc() call (since if it were a free() call, and allocated ranges were identical after the first n-1 such calls, they would still be identical after the first n, contradicting our assumption that they are different). Then for each executable, this nth malloc() call either allocates a block at or some distance past the end, or it subdivides some existing free block. We can remove the latter possibility (and simplify the proof) by assuming that there is no more memory available past the end of the highest byte thus far allocated (this is allowed, since a computer with that amount of memory could exist).

Now have both programs call free() on every allocated block except the one allocated in operation n. Let the resulting free range at the start of memory (before the sole remaining allocated block) have total length s1 in executable 1 and s2 in executable 2, and let the resulting free range at the end of memory (after that sole remaining allocated block) have length e1 in executable 1 and e2 in executable 2. By assumption, s1≠s2 and e1≠e2. Now have both executables call malloc() twice, namely, on s1 and e1 in descending order. Then, unless s1=e2, executable 1 can satisfy both malloc()s, but executable 2 can satisfy only the first. Similarly, calling malloc() on s2 and e2 in decreasing order will succeed in executable 2 but not executable 1, again unless s1=e2 holds.

What if s1=e2 does hold, though? This occurs when, say, one executable allocates the block 100 bytes from the start of memory, while the other allocates it 100 bytes from the end. In this case, all we need is to keep some second, symmetry-breaking block around at the end in addition to the block allocated by operation n -- that is, a block for which it does not hold that one allocator allocates the mirror-image memory range of the other. (If no such block exists, then the two allocators are perfect mirror images of each other.)


I really have no idea what you’re getting at here. This isn’t embedded where there’s a fixed pool to allocate out of. If there’s insufficient space it’ll get more virtual memory from the OS.

Also, nothing you’ve said actually says that the other allocator will be the worse one. Indeed, glibc is known to hold onto memory longer and have more fragmentation than allocators like mimalloc and tcmalloc so I’m still at a loss to understand how even if what you wrote is correct (which I don’t believe it is) that it follows that glibc is the one that won’t crash. If you’re confident in your proof by construction, please post a repro that we can all take a look at.


> If there’s insufficient space it’ll get more virtual memory from the OS.

Swap space is finite too.

> Also, nothing you’ve said actually says that the other allocator will be the worse one.

I'm not claiming that either is worse. I'm showing mathematically that for any two allocators that behave differently at all (with the one tiny exception of a pair of allocators that are perfect mirror images of each other), it's possible to craft a program that succeeds on one but fails on the other.

I didn't say so explicitly as I thought it was obvious, but the upshot is: It's never completely safe to just change the allocator. Even if 99% if the time, one works better than the other, there's provably a corner case where it will fail but the other does not.


I should have been more explicit about the assumptions I make about an allocator:

1. If malloc() is called when there exists a contiguous block of free memory with size >= the argument to malloc(), the call will succeed. I think you'll agree that this is reasonable.

2. Bookkeeping (needed at least for tracking the free list, plus any indexes on top) uses the same amount of malloc()able memory in each allocator. I.e., if malloc(x) at some point in time reduces the number of bytes that are available to future malloc() calls by y >= x bytes under allocator 1, it must reduce the number of bytes available to future malloc() calls by y under allocator 2 if called at the same point in time as well. This may not hold exactly in practice, but it's a very good approximation -- it's possible to store the free list "for free" by using the first few bytes as next and prev pointers in a doubly linked list.

To head one another possible objection off at the pass: If the OS allocates lazily (i.e., it doesn't commit backing store at malloc() time, instead waiting till there is an actual access to the page, like Linux does), this doesn't change anything: Address space (even 64-bit address space) is still finite, and that is still being allocated eagerly. In practice, you could craft the differentially crashing program to crash much faster if you call memset() immediately on every freshly malloc()ed block to render this lazy commit ineffective -- then you would only need to exhaust the physical RAM + swap, rather than the complete 64-bit virtual address space.


Swapping out allocators won't cause some programs to crash and others to not crash unless your program is already using up all available RAM or your program has a bug and the allocation pattern happens to be more likely to trigger invalid memory access in a way that crashes.

This allocation pattern idea is unlikely to show up in any real application except at the absolute limit where your exhausting RAM and the OOM killer gets involved. Even then I think you're going to not see the allocator be much of a differentiating factor.


Funny, cause the situations where I've had to replace glibc is always that it is a long running server that allocates often. Glibc: Ballooning memory, eventually crash. jemalloc: Stable as a rock.


> I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

I also work with large (8K) video frames [1]. If you're talking about the frames themselves, 60 allocations per second is nothing. In the case of glibc, it's slow for just one reason: each allocation exceeds DEFAULT_MMAP_THRESHOLD_MAX (= 32 MiB on 64-bit platforms), so (as documented in the mallopt manpage), you can not convince glibc to cache it. It directly requests the memory from the kernel with mmap and returns it with munmap each time. Those system calls are a little slow, and faulting in each page of memory on first touch is in my case slow enough that it's impossible to meet my performance goals.

The solution is really simple: use your own freelist (on top of the general-purpose allocator or mmap, whatever) for just the video frames. It's a really steady number of allocations that are exactly the same size, so this works fine.

[1] in UYVY format, this is slightly under 64 MiB; in I420 format, this is slightly under 48 MiB.


Buffers of that size are also in tcmalloc's "whatever" class, right? It just does a smarter[1] job of not unmapping them.

1: for a certain point of view


I’d be shocked if jemalloc or tcmalloc had issues with that workload.

Do you have a minimal reproducing example?


Sure, in plain C because why not. <https://gist.github.com/scottlamb/459a3ce6230be67bf4ceb1f1a8...>

There's one other element I didn't mention in my previous comment, which is a thread handoff. It may be significant because it trashes any thread-specific arena and/or because it introduces a little bit of variability over a single malloc at a time.

For whatever reason the absolute rate on my test machine is much higher than in my actual program (my actual program does other things with a more complex threading setup, has multiple video streams, etc.) but you can see the same effect of hitting the mmap, munmap, and page fault paths that really need not ever be exercised after program start.

In my actual (Rust-based) program, adding like 20 lines of code for the pooling was a totally satisfactory solution and took me less time than switching general-purpose allocator, so I didn't try others. (Also, my program supports aarch64 and iirc the vendored jemalloc in the tikv-jemallocator crate doesn't compile cleanly there.)


Sorry, I'm struggling to make sense of this comment. I don't know C or C compilers very well at all, but I read the full gist and felt I learned a bunch of stuff and got a lot of value from it.

But then when I read this top comment, it makes me concerned I've completely misunderstood the article. From the tone of this comment, I assume that I shouldn't ever do what's talked about in this gist and it's a terrible suggestion that overlooks all these complexities that you understand and have referenced with rhetorical-looking questions.

Any chance you could help me understand if the original gist is good, makes any legitimate points, or has any value at all? Because I thought it did until I saw this was the top comment, and it made me realise I'm not smart enough to be able to tell. You sound like you're smart enough to tell, and you're telling me only bad things.


I'll have a go at explaining: The process described in the article isn't a simple recipe that you can apply to any program to achieve similar results.

`jq` is a command-line program that fires up to do one job, and then dies. For such a program, the only property we really want to optimise is execution speed. We don't care about memory leaks, or how much memory the process uses (within reason). `jq` could probably avoid freeing memory completely, and that would be fine. So using a super-stupid allocator is a big win for `jq`. You could probably write your own and make it run even faster.

But for a program with different runtime characteristics, the results might be radically different. A long-lived server program might need to avoid memory bloat more than it needs to run fast. Or it might need to ensure stability more than speed or size. Or maybe speed does matter, but it's throughput speed rather than latency. Each of those cases need to be measured differently, and may respond better to different optimisation strategies.

The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.


> The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.

I think that's what confused and irritated me. There's a lot of value and learning in the gist - I've used JQ in my previous jobs regularly, this is the real world, and valuable to many. But the top comment (at the time I responded) is largely rhetorically trashing the submission based on purely the title.

I get that the gist won't make _everything_ faster: but I struggle to believe that any HN reader would genuinely believe that's either true, or a point that the author is trying to make. The literal first sentence of the submission clarifies the discussion is purely about JQ.

Anyone can read a submission, ignore any legitimate value it in, pick some cases the submission wasn't trying to address, and then use those cases to rhetorically talk it down. I'm struggling to understand why/how that's bubbling to the top in a place of intellectual curiosity like HN.

Edit: I should practice what I preach. Conversation and feedback which is purely cautionary or negative isn't a world that anyone really wants! Thanks for the response, I really appreciated it:) It was helpful in confirming my understanding that this submission does genuinely improve JQ on Ubuntu. Cautionary is often beneficial and necessary, and I think the original comment I responded to could make a better world with a single sentence confirming that this gist is actually valuable in the context it defines.


That’s why it’s a bad idea to use one allocator for everything in existence . It’s terrible that everyone pays the cost of thread safety even for single threaded applications - or even multithreaded applications with disciplined resource management.


While I do agree with the general sentiment, I think the default should be to use the safer ones that can handle multi threaded usage. If someone wants to use a bad allocator like glibc that doesn't handle concurrency well, then they should certainly be free to switch.


So now you’re using std vector and it’s taking a lock for no reason, even though push_back on the same vector across threads isn’t thread safe to begin with.

Haphazard multithreading is not a sane default.

I understand a million decisions have been made so that we can’t go flip that switch back off, but we’ve got to learn these lessons for the future.


Whatever. Allocation is slow enough that I don't care about a noncontended lock. Make things work by default and if you want to gain performance by not allowing multithreading then that should be possible and easy. But safety first, as in most cases it really don't matter. When it comes to mainstream general purpose allocators it isn't really a tradeoff anyhow, as all of them are nominally threadsafe.

Even glibc claims to be multithreading safe even if it tends to not return or reuse all freed memory.


This is a common pain point.

Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.

Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.

My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.

It's rather exhausting at times...


note: in the above, I meant 'zstd', not 'zlib'.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: