All About Libpas, Phil's Super Fast Malloc

RcouF1uZ4gsC · on June 1, 2022

> Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an always_inline function and the callee is always_inline results in specialization akin to template monomorphization. This works to any depth; the compiler won't be satisfied until there are no more always_inline function calls. This fortuitous development in compilers allowed me to write very nice template code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to always_inline functions (when we want specialization and inlining) and sometimes to out-of-line functions (when we want specialization but not inlining). Additionally, the C template style allows us to have true polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just pass around a pointer to the config they are using, and dynamically load and call things in that config, almost exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates.

I thought this was super interesting. Seems like a very nice technique for when you use function pointers in C for stuff like sort comparison. There was an earlier thread about PostgreSQL that discussed how they created multiple sort functions to avoid the overhead of a indirect function pointer call when sorting things like integers where the function call dwarfs the comparison cost. This seems like a technique that could have been used to avoid duplicating the sorting functions while still getting the benefits.

thrtythreeforty · on June 1, 2022

I still don't quite understand how the technique saves code size versus C++ templates, although I am quite eager to understand it since template instantiation is a thorn in my side when writing code for a microcontroller with 32KB of code space.

Is the idea that the struct contains pointers to always_inline functions, and that the compiler has visibility into this at the top of the always_inline call stack, so the always_inline gets monomorphized even through multiple layers of pointer indirection? If I'm understanding it right, that's a pretty nice technique. Would this allow you to mark a "class" (or individual members of that class) as "hot, please monomorphize" by making it always_inline, or "cold, please use regular pointers" by omitting always_inline?

C++ templates would force you to monomorphize every variant, or use vtables (function pointers) for every variant, without the ability to choose the approach for each subclass. The compiler is also spectacularly bad at recognizing deduplication opportunities between template specializations.

pizlonator · on June 1, 2022

Say you have:

    template<typename T>
    void foo(T& value) { value += 42; }

There's no easy way to tell the C++ compiler, "please make only one version of foo and use callbacks to figure out what it means to += on T".

In libpas, this would be done by first creating:

    always_inline void foo(void* value, void (*add_int)(void* value, int addend))
    {
        add_int(value, 42);
    }

And then you could also create:

    never_inline void foo_outline(void* value, void (*add_int)(void* value, int addend)) { foo(value, add_int); }

Now, if you want monomorphization like C++, call foo() and pass a literal for add_int. But if you don't, then call foo_outline.

Say you have 1000 callsites to foo(), they all use different types, and 5 of those callsites are hot while the other ones are hardly ever called (maybe they only run in unusual-but-necessary situations). So, the 5 hot callsites can call foo(), and the remaining 995 cold ones cal call foo_outline().

thrtythreeforty · on June 1, 2022

> There's no easy way to tell the C++ compiler, "please make only one version of foo and use callbacks to figure out what it means to += on T".

Partial counterpoint: you could use class polymorphism, if T is always a type you control. But you're right in general; C++ doesn't have typeclasses or some other way to create an ad-hoc vtable for, say, int.

> Now, if you want monomorphization like C++, call foo() and pass a literal for add_int. But if you don't, then call foo_outline.

Does this work through structs of function pointers? Is that the reason it's so powerful?

For example, a my_class constructor create_my_class makes the class point to foo_outline, unless it's known to be a hot type, then it points to foo. When you call create_my_class, would the compiler see the values of the pointers, and start inlining foo calls, including if you pass the my_class struct into foo as "this", continuing the inlining?

pizlonator · on June 1, 2022

Yes, it works with structs of function pointers. And it works recursively.

This works (this is slightly shorthand C, fill in the blanks yourself):

    struct config {
        void (*foo)(things bar);
        stuff (*bar)(int baz);
    };
    always_inline stuff doit(config c)
    {
        c.foo(whatever);
        return c.bar(42);
    }
    always_inline void my_foo(...) { ... }
    always_inline stuff my_bar(...) { ... }
    stuff dostuff(void)
    {
        config c;
        c.foo = my_foo; 
        c.bar = my_bar;
        return doit(c);
    }

This'll all get inlined and specialized to death.

thrtythreeforty · on June 1, 2022

Got it. That is definitely a cool insight. Thanks for sharing!

verdagon · on June 1, 2022

I think that's right.

C++ binaries tend to be pretty large because they use templates/monomorphization a lot, even when there aren't really meaningful performance gains. For example, it often doesn't make sense to have 20 instantiations of std::vector, when only two of them are used in the critical path. Rust can be even further in this unfortunate direction because of the borrow checker's restrictions around polymorphism and abstraction.

In comparison, C, Java, and Javascript tend to rely on dynamic dispatch more often (not that C has any built-in dynamic dispatch features, but in practice, C programs tend to use function pointers manually).

A few languages like C# have a really nice balance, and monomorphize in the JIT at run-time.

With this technique they mention, C is the only mainstream language I know of that lets you control it in a decoupled way, which is really cool.

RcouF1uZ4gsC · on June 1, 2022

> Would this allow you to mark a "class" (or individual members of that class) as "hot, please monomorphize" by making it always_inline, or "cold, please use regular pointers" by omitting always_inline?

From my understanding of the technique, this is correct.

verdagon · on June 1, 2022

IIRC, the branch predictor can work with function pointers as well. It can sometimes mitigate problems like this, though not as completely as just monomorphizing away the run-time branching/calling like they did.

ridiculous_fish · on June 1, 2022

Hey Phil, thanks for the write up! Curious why this is only used for JSC's JIT, and not to back its heap? Did the design of the JSC GC influence libpas at all?

pizlonator · on June 1, 2022

It's used for JSC's JIT and for all heap allocations in JavaScriptCore and WebKit that aren't GC allocations.

Riptide (the JSC GC) has a similar design and some of the key ideas were first pioneered there:

- The use of fast bitvector searches to find an eligible page (Riptide calls it first eligible block, I think).

- The use of bitvector simd to change object states in bulk.

- Riptide uses a mark bitvector that is indexed by minalign atom (16 bytes) and a block size of 16KB. Libpas's default "small segregated" configuration is 16 byte minalign, 16KB block, and an alloc bitvector indexed by minalign.

I have since thought about how Riptide could be wired up to use libpas. It's possible to do it. Already today you could create a libpas page_config that leaves room for mark bits in the page header, and makes them use the same indexing as the alloc bits. Riptide has a "newly allocated" bitvector that serves almost the same purpose as the alloc bits (kinda, if you squint really hard) - so you'd go from Riptides newly_allocated and mark bitvectors, to libpas's alloc bitvector and a custom mark bitvector stuffed into page headers using an exotic page_config. If JSC did this, it would buy:

- Better decommit policies than Riptide. I did a lot of tuning in libpas to make decommit Just Right (TM).

- Faster allocation of medium-sized objects.

- Possibly better memory efficiency.

But, Riptide is already very nicely tuned, and it gets some benefit from libpas's decommit rules because Riptide calls fastMalloc to get memory and sometimes uses fastFree to return it. fastFree calls bmalloc::api::free, which calls libpas's bmalloc_deallocate, which then goes to pas_deallocate with the BMALLOC_HEAP_CONFIG. So, Riptide already gets some of the goodness of libpas by sitting on top of it.

ryanianian · on June 1, 2022

Are there any projects that surgically augment the memory APIs to be more cooperative? Sure `malloc` implementations will make various tradeoffs, but what if the malloc/free APIs were expanded to expose more programmer-intent or cooperative defrag, etc?

- Perhaps `malloc` could ask for an intended lifecycle (think GC generation) that could swap arenas or other algo internals. This opens us up to meta-allocators.

- Perhaps program lifecycle hooks that give allocators more context into how already-allocated memory intends to be accessed. ("Done bootstrapping the server, so now new memory will be accessed by single threads and only live for one request", etc)

- Perhaps the programmer could periodically call `remalloc` to allow objects to be moved or promoted between arenas/generations.

- Perhaps mallocs/frees could be more intelligently batched (where explicit structs/arrays don't normally make sense) for surprise-free defrag or OS allocation.

- Intelligent pairing of malloc/free with explicit notions of intended re-use; some indication of "I will free this and ask for the same memory again in the same call" or auto-struct-ing mallocs which know that we tend stack malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual OS/memory involvement whatsoever.

- Perhaps `identmalloc` could handle common flyweight patterns of instantiate-unless-exists; e.g. intelligently turn shared state into offsets into a contiguous arenas. This may be more useful in C++ with existing conventions around copy-by-value constructs.

Some of these are indeed components of existing mallocs/allocators (e.g. libpas is type-aware), but my point is that allocators have to jump through hoops to derive heuristics, and these heuristics often amount to "the programmer often intends to use memory like X." Performance engineers then pattern-match for X rather than cooperating with the compiler to expose X intrinsically from the beginning. The above ideas impose slightly different programming paradigms, so there's still the white wale of a perfect drop-in heuristic, but it seems less subject to temporary or local maxima.

So: Are there compelling (albeit slightly less general-purpose) alternatives where the allocator and programmer communicate additional hints or other API contracts that allow the allocator to move away from heuristics and into proven patterns? Or is this basically akin to "use c++/smart pointers or just get a gc"?

pizlonator · on June 1, 2022

I've thought about this a lot. I think that overall, malloc/free/new/delete are already so hard to use that if you added more stuff, it would create too much cognitive load for the programmer. The result would be that the hints the programmer gave you would be more wrong than the malloc's best guess.

Let me go through these point by point and offer some thoughts.

- Perhaps `malloc` could ask for an intended lifecycle (think GC generation) that could swap arenas or other algo internals. This opens us up to meta-allocators.

It could, and if that information was accurate, then it would be profitable. But if that information was inaccurate, then you'd get worse perf and worse memory usage than if you let malloc just do whatever it wants. Also, anything like this creates more fragmentation since it creates a new opportunity for the programmer to force the allocator to pick something other than the first fit.

- Perhaps program lifecycle hooks that give allocators more context into how already-allocated memory intends to be accessed. ("Done bootstrapping the server, so now new memory will be accessed by single threads and only live for one request", etc)

If you convey this information after you've already allocated the object then there isn't much the malloc could do. If the malloc uses this information to decide where the next malloc call puts memory then you've created a fragmentation risk.

- Perhaps the programmer could periodically call `remalloc` to allow objects to be moved or promoted between arenas/generations.

Yes, they could, and this is a great idea. I believe that calling realloc (or just mallocing a new object, memcpying or copy-constructing, and freeing the old one) relieves fragmentation because it gives the malloc an opportunity to put the object in a more profitable location.

- Perhaps mallocs/frees could be more intelligently batched (where explicit structs/arrays don't normally make sense) for surprise-free defrag or OS allocation.

libpas batches allocation and deallocation. These strategies help performance. They have only a small effect on memory usage (and that effect is that you use a bit more memory by batching).

- Intelligent pairing of malloc/free with explicit notions of intended re-use; some indication of "I will free this and ask for the same memory again in the same call" or auto-struct-ing mallocs which know that we tend stack malloc(foo);malloc(bar);free(bar);free(foo). Avoid any actual OS/memory involvement whatsoever.

Allocators like libpas (and mimalloc and many others) have very cheap malloc/free calls that perform great in these situations. They don't do any locking or call into the OS in the common case when you do this.

- Perhaps `identmalloc` could handle common flyweight patterns of instantiate-unless-exists; e.g. intelligently turn shared state into offsets into a contiguous arenas. This may be more useful in C++ with existing conventions around copy-by-value constructs.

I think that most fast mallocs like libpas (and others) make this situation cheap enough that you don't need anything else.

- Some of these are indeed components of existing mallocs/allocators (e.g. libpas is type-aware), but my point is that allocators have to jump through hoops to derive heuristics, and these heuristics often amount to "the programmer often intends to use memory like X." Performance engineers then pattern-match for X rather than cooperating with the compiler to expose X intrinsically from the beginning. The above ideas impose slightly different programming paradigms, so there's still the white wale of a perfect drop-in heuristic, but it seems less subject to temporary or local maxima.

If the program is simple enough that you understand the object lifetimes completely, or you have enough resources to throw at the problem that you can completely understand lifetimes even for a complex program, then the best path forward is to just not call malloc. Lay out your memory a priori or write a highly customized "allocator" (that may not even have an API that looks anything like malloc).

If the program is too complex for such reasoning, then I believe there's no way for the programmer to tell the malloc anything it can't deduce with heuristics. The programmer can say some stuff, but when you have a large sophisticated heap, then the programmer will not be able to predict what it is about the heap that creates pain for the malloc. In most experiments where I've tried to do this kind of tuning, I end up with something that runs slower or uses more memory, and the investigation usually leads me to learn that my hypothesis about the malloc's pain point was totally wrong.

- So: Are there compelling (albeit slightly less general-purpose) alternatives where the allocator and programmer communicate additional hints or other API contracts that allow the allocator to move away from heuristics and into proven patterns? Or is this basically akin to "use c++/smart pointers or just get a gc"?

Maybe! I'm not proving a negative here. But everything I've tried in this area has been a negative result for me.

twoodfin · on June 1, 2022

I've thought about this a lot. I think that overall, malloc/free/new/delete are already so hard to use that if you added more stuff, it would create too much cognitive load for the programmer. The result would be that the hints the programmer gave you would be more wrong than the malloc's best guess.

I think this is a generally applicable lesson. Programmers overestimate their ability to statically optimize the dynamic behavior of their code, especially relative to a runtime that can access and act on dynamic state.

This comes up in the database world as application developers desiring to "pin" certain tables in memory. Experience has shown this to be an anti-feature: As the schema and workload of a database system evolves, these pinning decisions can become obsolete, starving more relevant tables of buffer space. Dynamic decisions, even made by simple heuristics like LRU, are an empirically winning strategy.

The failure of Itanium/EPIC and the "sufficiently smart compiler" to beat traditional OoO execution is another example from a different computing domain.

titzer · on June 3, 2022

I'm even more bullish on computers assisting humans in memory allocation than ever. I think we as humans should just write programs and assume an unboundedly intelligent compiler will study our code, determine what if any lifetimes are clear in the code from usage (e.g. basically type-inferring ownership, rather than type-checking it), do trial region allocation, with dynamic profiling and adaptive optimization, and GC the rest with the best collector based on dynamic tuning.

Basically, just write assuming GC, then whereever a hint could possibly exist, throw computers at it. If anything, such hints should live outside the source code and have no semantic impact (i.e. program works fine if you throw it out).

irogers · on June 2, 2022

Not having cognitive load motivates something like the LLAMA work: https://research.google/pubs/pub49008/

dragontamer · on June 1, 2022

General purpose memory-allocation is a lie. Everything has use cases where they're faster than other libraries.

I think that's why we keep seeing newer malloc schemes pop up, because the performance of the heap 100% depends on the use-case, and different people have different use cases.

Still, studying everyone else's heaps (and garbage collectors, a closely related discussion) is probably good for high-performance programmers.

pizlonator · on June 1, 2022

Yeah, you're totally right.

Libpas beats other mallocs in WebKit. I also had benchmarks involving non-WebKit workloads and the results were all over the place (sometimes slower than other mallocs by a lot, sometimes faster by a lot - same thing with memory, sometimes more efficient, sometimes less). This didn't surprise me; I've seen this before when writing memory management code.

I think it makes sense for large software projects that use malloc a lot and care about perf to eventually either write their own malloc or to take an existing one and then tune it a lot.

xenadu02 · on June 2, 2022

We should have libmetamalloc that tracks a history of program invocations using the actual workload on the actual machine. Cycle through different malloc implementations for each execve(). After gathering enough statistical data select the optimal implementation. The next step would be a basic ML model that looked at a few variables like time of day, args, etc to determine when to switch allocators.

If an OS used such a thing by default it would figure out that it should use libpas on the programs that were faster in your tests. Since most programs have zero effort put into optimizing allocators (or much of anything else) it would likely be a win even given the complexity. Many things are branch predictors if you squint!

Note: Not even I can tell if I'm joking or serious with this comment.

cmrdporcupine · on June 1, 2022

100%, and also I think there's a tendency among people who compare 'higher up the stack' in GC'd languages to manual heap allocation in C/C++ etc. in a way that implies the the latter is almost "free" from a performance POV when in fact underneath the covers in these allocators there's still a lot of the same kind of walking of datastructures that also happens inside a tracing GC.

You can get unpredictable pauses from malloc/free, too.

asveikau · on June 1, 2022

I think in either case, GC or not, you write something intuitively and then when it becomes an actual problem, you study the patterns and improve them, but most of the time you can leave it alone and the general purpose thing is good enough.

Or if you are in a niche like audio/video or something, you avoid allocations all together during the bulk of the code.

cmrdporcupine · on June 1, 2022

Oh for sure. Premature optimizing is usually bad form.

What is tricky about performance issues in allocation is that they can be hard to profile. There are tools for analyzing GC performance in GC'd languages, but sometimes malloc/free can just be a big black box.

djmips · on June 1, 2022

Premature optimization is not bad form when it's re-framed as good architecture. So it's not 'usually bad form' to architect something from the outset, using your experience, and that's something everyone understands. This pervasive disdain for premature optimization leads to bad architecture that often leads to expensive rewrites. So because the word optimization is so overloaded and treated with disdain it feels like we need a new language to talk about what's really meant by 'premature optimization' ( the bad kind)

asveikau · on June 1, 2022

That's one reason I point out a few comments above that certain niches will need to take it into account ahead of time. Eg. An a/v application will typically allocate all buffers up front and re-use them frequently rather than return them to the allocator. A lot of server applications will want to keep per-client memory usage low. For general purposes, there's the general purpose allocator.

djmips · on June 2, 2022

Very true. Also video games, and particularly rendering.

latenightcoding · on June 1, 2022

I've seen a few opensource projects archive their custom allocators because they were not beating the system's one or jemalloc. So if you have the skill and time then yeah go for it, else stick with general purpose ones.

dragontamer · on June 1, 2022

Its very easy to beat the general purpose ones if you know your exact use case.

Ex: 16-bit pointers is a 65536-sized heap. Assume 8-bytes per element, that's 512KB of space. A bit small, but large enough to so a lot of things.

65536 elements can be represented as a bitmask. The bitmask only takes up 8192-bytes (8KB), which fits inside of 16 AVX512 registers (Intel offers 32x AVX512/ZMM registers btw). Or it fits inside of GPU __shared__ memory.

If you need multithreaded, you perform atomic AND and atomic OR to clear, and set, the bitmask as appropriate.

-------

Do you know the exact size of your heap? The size of the pointer? The amount of parallelism involved? What about the size of the elements? Access pattern? (Bump-alloc'd Linked Lists are a sequential traversal over RAM btw, so that's very efficient), etc. etc.

The 64-bit pointer is overkill for most people's purposes. 32-bits represents 4-billion objects, and if each object is 16-bytes long that's 64GBs of RAM. Right here right now, a 32-bit pointer / custom allocator already offers many benefits over the 64-bit pointer. (half-sized pointers, more data in cache, etc. etc.)

thesz · on June 1, 2022

512 in AVX512 is the number of bits per register. You are off by factor of 8.

dragontamer · on June 2, 2022

Agreed. Thanks for pointing out the mistake.

mananaysiempre · on June 1, 2022

There are a couple of approaches that might legitimately be simpler than using a general-purpose allocator: in a single-pass batch process, just throw things on the floor and let the OS sort it out on exit[1] (I think the D compiler does this); in a multiple-pass batch process, make each pass litter its own room (arena, obstack, etc.) then demolish said room when done (GCC does this or at least did in the past).

On the other hand, these may require rearranging the logic somewhat to fit them, so the question of how much rearrangement to tolerate still remains. And, well[4],

  /*
   * We divy out chunks of memory rather than call malloc each time so
   * we don't have to worry about leaking memory.  It's probably
   * not a big deal if all this memory was wasted but if this ever
   * goes into a library that would probably not be a good idea.
   *
   * XXX - this *is* in a library....
   */

[1] I am rather dismayed by how Raymond Chen advocates for this for memory[2] but insists it could not possibly be a good idea for Windows GDI handles, no, you stupid sloppy programmer[3]. Maybe because he was involved in GDI from the other side? (Of course, for file descriptors on Unix it’s still the standard practice.)

[2] http://bytepointer.com/resources/old_new_thing/20120105_006_...

[3] http://bytepointer.com/resources/old_new_thing/20051014_305_...

[4] https://github.com/the-tcpdump-group/libpcap/blob/4c1e516dd2...

jstimpfle · on June 1, 2022

This "[2] vs [3]" is a good find. I have several possible explanations. It could be the difference of 7 years between those posts. It could also be that Windows Handles can't be cleaned up in userspace. Not how malloc chunks up the "handles" (i.e. pages / regions) that it got from Windows. This is, AFAIK, different than HANDLE's that have to be requested from the system one-by-one and can't be chunked up. (Or am I wrong? I actually don't know a lot about how this works under the hood).

tptacek · on June 1, 2022

This is a pretty well-known CS truism, isn't it? One of the first papers I ever fell in love with covers it in detail: https://users.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pd...

jeffbee · on June 1, 2022

Well the other browser's design doc is not anywhere near this detailed, but if you want to compare and contrast the overall design here is Chromium's PartitionAlloc.

https://chromium.googlesource.com/chromium/src/+/master/base...

exikyut · on June 1, 2022

A weekend project idea I just thought of, but wouldn't be able to commentate on as well as someone more experienced in memory management:

Run a bunch of memory allocator demo programs (trivial and complex) under blinkenlights (https://justine.lol/blinkenlights/), a program execution visualizer that shows how programs interact with the stack and heap as they run.

For bonus points, proceed to explain why their visualized performance is different. :)

jibcage · on June 1, 2022

Fun fact: I believe PAS stood for “Phil’s awesome shit.” Whether that’s subject to NDA is anyone’s guess :D

pizlonator · on June 1, 2022

I can neither confirm nor deny this awesome conjecture.

lidavidm · on June 1, 2022

I wonder how this compares to jemalloc, mimalloc, snmalloc?

pizlonator · on June 1, 2022

I never got a chance to compare it to those, since I was most interested in beating bmalloc. And I mainly wanted to beat it on Safari workloads.

I believe bmalloc was previously compared against jemalloc and tcmalloc, also using Safari workloads, and bmalloc was significantly faster at the time.

IshKebab · on June 1, 2022

For other people who have never heard of bmalloc - it's a custom allocator used only by WebKit. I guess it's not surprising they added one since the Mac system allocator is extremely slow. A custom allocator is pretty much a free 20% speed up on Mac (depending on your workload) but I found they made no difference on Linux. Haven't tried on Windows.

pizlonator · on June 1, 2022

Just some credit where credit is due.

I measured the system malloc crushing all other mallocs on some workloads and they were the kind of workloads that some folks run every day.

I measured the system malloc crushing most other mallocs on memory efficiency on most workloads. System malloc is really good at reusing memory and has very mature decommit policies that easily rival what I came up with.

So there’s that.

meisel · on June 1, 2022

What sort of workloads are those? I have yet to see a workload where macOS's system allocator is not substantially slower than alternatives like jemalloc.

pcwalton · on June 1, 2022

macOS malloc was incredibly slow for Pathfinder, far behind every other OS. Everything became bottlenecked on it. It was a free 2x speedup if not more to switch to jemalloc.

I suspect this is because Pathfinder's CPU portion is a multicore workload and macOS allocator performs poorly when under heavy multithreaded contention. It probably just isn't the kind of workload that macOS allocator was tuned for.

pizlonator · on June 1, 2022

My understanding is that the system malloc is excellent under contention, but has a high baseline cost for every malloc/free call.

I don't remember exactly what workloads it performed really great at (and if I did I dunno if I could say), but I do remember they were parallel, and the speed-ups got bigger the more cores you added.

Everything else about your experience matches mine. Libpas is much faster than system malloc in WebKit and JSC. The difference isn't 2x on my preferred benchmarks (which are large and do lots of things that don't rely on malloc), but it is easily more than 2x on smaller benchmarks. So your 2x result sounds about right.

astrange · on June 1, 2022

Alas, many people think performance engineering is only about the wall clock time of their program and not what impact it has on anything else.

malloc performance highly depends on how large the allocations are though.

KerrAvon · on June 1, 2022

What's the workload where you found that to be the case? The system allocator is heavily tuned for general use, and should be very difficult to beat -- generally.

IshKebab · on June 1, 2022

It was a non-traditional compiler written in C++.

> The system allocator is heavily tuned

Perhaps but it probably has more constraints than a custom allocator too, e.g. backwards compatibility with software that unwittingly depends on its exact behaviour.

I found a discussion of this where people reported various speedups (10%, 100%, 300%) on Mac by switching to a custom allocator: https://news.ycombinator.com/item?id=29068828

I'm sure there are better benchmarks if you Google it.

kookamamie · on June 2, 2022

tcmalloc and mimalloc are the ones I'd be interested seeing comparisons to. The former is the fastest in my case, which is essentially many-core HPC.

mjp41 · on June 2, 2022

libpas is now in mimalloc-bench (https://github.com/daanx/mimalloc-bench). Thanks to Julien Voisin (https://dustri.org), which makes about 20 allocators that can be compared on a collection of workloads.

carlsborg · on June 1, 2022

Trying to understand the license for bmalloc, and webkit in general. The WebCore folder has an Apple license and two LGPL licenses. The root folder has no license file. Can anyone comment?

pizlonator · on June 1, 2022

libpas is BSD 2-clause. so is bmalloc. libpas is in the bmalloc directory in WebKit, but is really a separate piece of code.

each file has its own license header.

donfuzius · on June 1, 2022

This looks awesome and I really like the tone of the documentation. Great work! At the same time I'm sort of worried, that such a deep change in the memory management of my day-to-day browser opens up a significant security risk for a considerable time, until new bugs are found and closed. I'm sure this got more than one pair of eyes at apple, but I'd really welcome a specific bug bounty on such a change...

pizlonator · on June 1, 2022

That's a fair point. Code churn creates bugs. That's a good argument for slowing down the pace of browser development overall, except maybe the kind of development that focuses on security hardening.

Libpas has some security features that bmalloc didn't have. It goes further than bmalloc in avoiding inline metadata (bmalloc had an IsoHeap<> thing that used scrambled inline freelists). Also, libpas opens the door to isoheaping all allocations in WebKit (see https://bugs.webkit.org/show_bug.cgi?id=231938), which would be a big security win.

Also, libpas has been fuzzed to death. In test_pas, there's a bunch of fuzz tests I call "chaos" and I've run those for extended periods of time (days, weeks) after major changes. Of course it still has bugs, but considering how insane the heap organization is, libpas also makes attackers' lives much harder in the short term.

The combination of libpas's enhanced security features and the fact that it substantially changes heap layout might be enough to cancel out the fact that it has increased bug tail. And... if you're worried about bug tail from major changes introducing security vulns then there are probably much worse major changes that someone could do.

meisel · on June 1, 2022

Will this have support for macOS by inserting itself, as jemalloc does, as the default malloc zone? If so, does that mean it will be very fast at determining if a given pointer was allocated by libpas?

pizlonator · on June 1, 2022

libpas can do that, though some of the code to do it is not in the WebKit repo (but the enumerator support - a complex component that is necessary to fully replace system malloc - is in the repo).

libpas can already determine if it owns a pointer in O(1) time and no locking. It uses megapages for this purpose. Though, right now, it's not configured to fully leverage this (if you pass an unknown ptr to free then as a last step it'll lock the heap_lock even though it would be a small change to just use megapages there).

mattgreenrocks · on June 1, 2022

I admit I'm a sucker for metacircularity, but have to ask:

> The bootstrap heap has hacks to allow itself to allocate that array out of itself.

Not quite following that, can someone elaborate?

pizlonator · on June 1, 2022

The bootstrap heap has a freelist that is just an array of entries that tell you where the free memory is.

That array has to be allocated somewhere.

All pas_simple_large_free_heaps allocate that array from the bootstrap heap. Including the bootstrap heap. So, the pas_simple_large_free_heap has hacks that go something like: "if I'm allocating or freeing something and I need to allocate or free the free list, then very carefully also perform that allocation/deallocation in tandem with the one I'm doing". The main hack there is just that there's some statically allocated "slop" for the boostrap free list, so if you want to add to it in the middle of an allocation, you add it to the slop array, and then after you're doing you reallocate the free list and copy stuff from slop into it.

mattgreenrocks · on June 2, 2022

Ahhhhhh, makes sense now. Thank you!

convolvatron · on June 1, 2022

I suspect its that - generally when you write a heap it needs to keep some metadata. if you have a multi-heap system (really handy), then its generally easier to host that metadata on another heap. sometimes you really want that, because the target heap might be specialized for different kinds of data. so I assume this is carefully constructing the heap state so that it can be used for its own metadata as its being constructed

rurban · on June 2, 2022

pas stands for "Phil's Awesome System" though, not Phil's Superfast AS

pizlonator · on June 2, 2022

At some point I started telling people that the acronym stood for a different thing every time I talked about it. There is a place where I called it “Phil’s Awesome System”.

sydthrowaway · on June 1, 2022

Did the author leak this IP when he left Apple?

pizlonator · on June 1, 2022

lol

WebKit is open source. Libpas was always designed with WebKit in mind. It landed in WebKit half a year before I left Apple.

rq1 · on June 1, 2022

> Consequently, passing a function pointer (or struct of function pointers), where the pointer points to an always_inline function and the callee is always_inline results in specialization akin to template monomorphization. This works to any depth; the compiler won't be satisfied until there are no more always_inline function calls. This fortuitous development in compilers allowed me to write very nice template code in C. Libpas achieves templates in C using config structs that contain function pointers -- sometimes to always_inline functions (when we want specialization and inlining) and sometimes to out-of-line functions (when we want specialization but not inlining). Additionally, the C template style allows us to have true polymorphic functions. Lots of libpas slow paths are huge and not at all hot. We don't want that code specialized for every config. Luckily, this works just fine in C templates -- those polymorphic functions just pass around a pointer to the config they are using, and dynamically load and call things in that config, almost exactly the same way that the specialized code would do. This saves a lot of code size versus C++ templates.

This document is incredible!

photochemsyn · on June 1, 2022

Seems like a real tour de force of low-level memory management. Prior to seeing this my go-to for understanding custom low-level (for embedded) memory managers was this (explains the basics of stack, block, bitmap and thread-local allocation) but now I have something far more complicated to get confused about.

https://www.embedded-software-engineering.de/dynamic-memory-...

pizlonator · on June 1, 2022

Thank you! :-)

I always wanted to write a malloc, and this is what I came up with after 3.5 years or so.

adamnemecek · on June 1, 2022

I'm always suspicious of things that are named after the maker.

thrtythreeforty · on June 1, 2022

Like Linux, or gcc.godbolt.org?

I kid, but I don't see it as a downside for software. Naming a scientific phenomenon after yourself is a red flag though, see also the Crackpot Index [1].

[1]: https://math.ucr.edu/home/baez/crackpot.html

tialaramex · on June 1, 2022

Godbolt says Compiler Explorer right there, the fact people call it "Godbolt" is because that's the memorable name rather than because Matt Godbolt tried to name it after himself. He calls it Compiler Explorer.

You can't control what people call things. My favourite example is Blackfriars Bridge in London. When that bridge was built, it was formally named after William Pitt. But like, who cares? In what way is this bridge, Pitt's bridge? People knew where it was, the Blackfriars area of London, so too bad William Pitt, everybody called it Blackfriars Bridge. And so, that's what its name is.

The priory which is why "Blackfriars" got the name hadn't existed for two hundred years by the time the bridge was proposed. Nobody living when the bridge was finished would have ever known anybody who'd met any of the friars during their lifetime. Nevertheless that part of London had "always" been called Blackfriars, and so a bridge and a railway station built many years later are named that too.

bool3max · on June 1, 2022

People call it Godbolt because you access it via godbolt.org,

pizlonator · on June 1, 2022

It's called libpas, and pas stands for whatever you like.

fmajid · on June 1, 2022

Did you know the entire US Air Traffic Control system runs on long-discontinued computers and written in JOVIAL, "Jules' Own Version of the International Algebraic Language" (i.e. ALGOL), named after Jules Schwartz?

https://en.wikipedia.org/wiki/JOVIAL

The FAA's been trying to replce it for ages, but AFAIK it's still ongoing.

throwaway744678 · on June 1, 2022

From the top of my head: Linux, Bourne shell...

adamnemecek · on June 1, 2022

[flagged]

kbenson · on June 1, 2022

Sometimes things are named after the maker ex post facto because they need a way to distinguish it. The Bourne shell, mentioned here, is likely called that for the same reason the very first shell was called the Thompson shell, in that they were written by a single person originally and not really given a name (the original being from Ken Thompson), since they were just different (or the original) implementations of the system shell. Eventually you're going to need a way to distinguish one from the other though, right? Noting who wrote it is as good (probably better) than any other option.

Linux was probably named differently, but also probably doesn't really match whatever criteria you're using to be suspicious. It started as a hobby project to make a free and open source UNIX system, of which there were few (none?), written by one guy. There was no expectation that other people would use it for anything other than something interesting to play around with, and using a portion of your own name for your own little hobby project isn't something I would look askance at, personally, especially in an age where there was no expectation that it might grow to anything, because that was definitely not the norm at the time.

ncmncm · on June 1, 2022

"Linux" was forced on him.

astrange · on June 1, 2022

A better rule is to be suspicious of anything with “fast” or synonyms in the name. It either won’t be fast or it’ll be insecure.

akrymski · on June 1, 2022

Now if only I could limit the amount of memory a browser would consume! The fact that desktop browsers grind a 16GB machine to a halt, while on mobile devices having 1000s of tabs is a non-issue is plain wrong. Since I can't limit a process' memory on Mac OS (any OS?), why not give me a setting in preferences to do so? After all the browser is really just a VM at this stage. I'm quite certain I don't need 100 tabs in active memory at once.

While at it, why can't we hibernate tabs? Write the JSC heap and DOM state to disk, the way VMs do it?

What's taking up the majority of my memory anyway - the graphics / gpu side or the JS VM?

There's a whole spectrum of startups trying to solve the webkit memory consumption issues crudely, such as https://www.mightyapp.com

pizlonator · on June 1, 2022

I think that desktop browser users expect that their tabs generally won't get reloaded. On mobile, users have generally accepted that their tabs will get reloaded. Maybe browsers should just reload tabs if anything fishy happens or you're not using them.

Hibernating tabs to disk is hard because the JS heap (and the DOM heap) will contain lots of stuff that is tied to handles you got from the OS. It's not impossible. But it would require a lot of work to build specialized serialization functions for stuff that has native state attached to it that the browser doesn't control or have good visibility into.

I think that some websites use a lot of graphics memory while other websites use a lot of JS memory. And there are other categories of memory usage, too.

Maybe if they put the cloud browsers on the blockchain using machine learning, and then write them in Rust and compile them to wasm, then it will be even better.

akrymski · on June 1, 2022

> I think that desktop browser users expect that their tabs generally won't get reloaded

But they do get reloaded. There's absolutely no guarantee provided, and browser tabs will get swapped out of memory as necessary (eg opening lots of new tabs or apps).

This is easily solvable though: let me pin tabs that I really need to persist.

pizlonator · on June 1, 2022

Sure they do, but much less aggressively than mobile browsers. I think that desktop Safari will try to keep your tab live so long as nothing fishy happens. Based on my experiences using Chrome, it seems to do the same thing.

tptacek · on June 1, 2022

This is a thread about an allocator design, not really about Webkit.

akrymski · on June 1, 2022

It's also a thread about memory allocation in Webkit tho.