I wonder how this compares to jemalloc, mimalloc, snmalloc?

pizlonator · on June 1, 2022

I never got a chance to compare it to those, since I was most interested in beating bmalloc. And I mainly wanted to beat it on Safari workloads.

I believe bmalloc was previously compared against jemalloc and tcmalloc, also using Safari workloads, and bmalloc was significantly faster at the time.

IshKebab · on June 1, 2022

For other people who have never heard of bmalloc - it's a custom allocator used only by WebKit. I guess it's not surprising they added one since the Mac system allocator is extremely slow. A custom allocator is pretty much a free 20% speed up on Mac (depending on your workload) but I found they made no difference on Linux. Haven't tried on Windows.

pizlonator · on June 1, 2022

Just some credit where credit is due.

I measured the system malloc crushing all other mallocs on some workloads and they were the kind of workloads that some folks run every day.

I measured the system malloc crushing most other mallocs on memory efficiency on most workloads. System malloc is really good at reusing memory and has very mature decommit policies that easily rival what I came up with.

So there’s that.

meisel · on June 1, 2022

What sort of workloads are those? I have yet to see a workload where macOS's system allocator is not substantially slower than alternatives like jemalloc.

pcwalton · on June 1, 2022

macOS malloc was incredibly slow for Pathfinder, far behind every other OS. Everything became bottlenecked on it. It was a free 2x speedup if not more to switch to jemalloc.

I suspect this is because Pathfinder's CPU portion is a multicore workload and macOS allocator performs poorly when under heavy multithreaded contention. It probably just isn't the kind of workload that macOS allocator was tuned for.

pizlonator · on June 1, 2022

My understanding is that the system malloc is excellent under contention, but has a high baseline cost for every malloc/free call.

I don't remember exactly what workloads it performed really great at (and if I did I dunno if I could say), but I do remember they were parallel, and the speed-ups got bigger the more cores you added.

Everything else about your experience matches mine. Libpas is much faster than system malloc in WebKit and JSC. The difference isn't 2x on my preferred benchmarks (which are large and do lots of things that don't rely on malloc), but it is easily more than 2x on smaller benchmarks. So your 2x result sounds about right.

astrange · on June 1, 2022

Alas, many people think performance engineering is only about the wall clock time of their program and not what impact it has on anything else.

malloc performance highly depends on how large the allocations are though.

KerrAvon · on June 1, 2022

What's the workload where you found that to be the case? The system allocator is heavily tuned for general use, and should be very difficult to beat -- generally.

IshKebab · on June 1, 2022

It was a non-traditional compiler written in C++.

> The system allocator is heavily tuned

Perhaps but it probably has more constraints than a custom allocator too, e.g. backwards compatibility with software that unwittingly depends on its exact behaviour.

I found a discussion of this where people reported various speedups (10%, 100%, 300%) on Mac by switching to a custom allocator: https://news.ycombinator.com/item?id=29068828

I'm sure there are better benchmarks if you Google it.

kookamamie · on June 2, 2022

tcmalloc and mimalloc are the ones I'd be interested seeing comparisons to. The former is the fastest in my case, which is essentially many-core HPC.

mjp41 · on June 2, 2022

libpas is now in mimalloc-bench (https://github.com/daanx/mimalloc-bench). Thanks to Julien Voisin (https://dustri.org), which makes about 20 allocators that can be compared on a collection of workloads.