Why does the default glibc allocator have relatively poor performance? (What mak...

db48x · on Nov 15, 2014

The man page explains several of the differences (http://www.canonware.com/download/jemalloc/jemalloc-latest/d...). These include a different API (you can pass in some flags that explicitly direct it how to do alignment, initialization, etc), reduced use of sbrk in favor of mmap, multiple arenas for allocations of different sizes, per-cpu arenas to eliminate lock contention, etc.

Mozilla has been using jemalloc since the release of Firefox 3 in 2008, primarily because having multiple arenas for allocations of different sizes means that you end up with less fragmentation over time. Fragmentation happens when you deallocate ten 100-byte objects that are scattered around in memory, then need to allocate one thousand-byte object. You can't reuse those 10-byte empty spots in memory, because that 1000-byte allocation has to be contiguous. This means that your total memory usage has to grow. Jemalloc puts small allocations into buckets where they'll be closer together, making it more likely that new allocations will be able to fill the holes left by deallocations, and that freeing small objects will eventually empty a memory page, allowing you to return the page to the OS.

I remember that we were all pretty chuffed at the improvements in Firefox's memory usage; it was quite significant and would surely reduce the general perception that Firefox was a memory hog. Just using jemalloc reduced memory usage by 22% (http://blog.pavlov.net/2008/03/11/firefox-3-memory-usage/ has a really amazing graph comparing FF2 and FF3 to IE7).

Rust uses jemalloc for the same reasons; there's just no technical reason not to use it.

azakai · on Nov 15, 2014

> Rust uses jemalloc for the same reasons; there's just no technical reason not to use it.

If so, I wonder if this is true for every language, then? For example, for C++, should gcc and clang emit binaries with jemalloc linked in?

nteon · on Nov 15, 2014

no - its not the job of the compiler to choose your memory allocator. There is talk, however, of replacing the current glibc allocator with jemalloc[1], which is the more appropriate place for that decision to be made.

1 - https://sourceware.org/ml/libc-alpha/2014-10/msg00419.html

derefr · on Nov 16, 2014

To me, it seems like the most sensible thing would be to make the choice of memory allocator a global OS configuration parameter—like enabling DEP on Windows. It'd still only be the program (or language runtime) deciding for itself whether to obey the allocator suggestion—but having one system-level place for this kind of policy information to be stored, where glibc et al could look for it, would make a lot more sense.

db48x · on Nov 16, 2014

No thanks; that'd be worse. Then you'd have to test every release of your program with every allocator in combination with every other configuration change that could be made.

Besides, different applications work best with different allocators so it's definitely a choice the author of a program needs to make.

shmerl · on Nov 16, 2014

FreeBSD uses jemalloc by default in its libc. But it's not compiler's decision.

gosu · on Nov 16, 2014

Probably the fact that it eagerly coalesces newly freed blocks where high performance allocators like tcmalloc and jemalloc use slabs of fixed-sized blocks which are never coalesced. Coalescing will load adjacent blocks into cache, and it probably implies doubly linked lists where slab allocators can get away with cache-friendlier single links or bitmaps etc. (Because you probably have to remove a block from the middle of a free list if you've just coalesced it with a newly freed block). It turns out cache footprint is key for speed and scalability in allocators, so that's a big problem.

I'm embarassed to link this twice in a row now, but at one point I played around with the perf of a couple allocators, including glibc's ptmalloc. I wound up answering exactly this question, at least for some simple use cases: http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html.

Comments in ptmalloc sources say they basically stick to the algorithm described here: http://gee.cs.oswego.edu/dl/html/malloc.html.

dmix · on Nov 15, 2014

This stuff is out of the scope of my knowledge, but I do know libraries such as OpenBSD's memory allocator are slower due to security optimizations. jemalloc is not known for it's security, instead it seems to be focused on performance. Security features such as ASLR, not leaving old memory around, zeroing out memory on release, etc, are not things that improve an allocators performance.

wyager · on Nov 15, 2014

Good point. I wonder if Rust's allocator needs fewer security features because of heavy compile-time heap safety checking. For example, I don't believe use-after-free or double-free is possible in Rust due to Rust's lifetime system (barring use of unsafe code blocks).

musername · on Nov 16, 2014

I second that question, as it was my first thought about the difference to the malloc in glibc, or whether it's just ignoring the supposed benefits.

db48x · on Nov 15, 2014

I'm not sure the point about ASLR is entirely correct; ASLR does randomize the addresses returned by mmap (slightly), and jemalloc primarily uses mmap to allocate its arenas.

emeryberger · on Nov 19, 2014

No good reason for why it hasn't changed, except that the underlying single-threaded allocator (Doug Lea's malloc) is really good. The multithreaded version, which is a "per-thread" (not really) wrapper around the Lea allocator has serious problems that have been known for a long time. I wrote a paper outlining and demonstrating the problems in 2000. The inertia is difficult to explain.

Paper here: http://people.cs.umass.edu/~emery/pubs/berger-asplos2000.pdf

Hoard replacement allocator here: http://www.hoard.org/

(Note that Hoard's design has evolved considerably since 2000; the Mac OS X allocator is based on an older version.)

nteon · on Nov 15, 2014

the glibc allocator is based off of Doug Lea's malloc that was developed before multithreading was common. Optimizations for multiple threads + arenas were retrofitted on top of the existing code, whereas jemalloc was designed more recently from scratch for multi-threaded workloads, and was able to take advantage of recent research.

lgeek · on Nov 16, 2014

I would suggest reading the jemalloc paper [0]. The main improvements compared to phkmalloc are:

* improved cache locality

* reduced false sharing

* use of multiple arenas for reduced lock contention

[O] http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemall...

pcwalton · on Nov 15, 2014

It tends to take more locks in multithreaded scenarios instead of making heavy use of per-thread arenas.

ExpiredLink · on Nov 16, 2014

> glibc allocator have relatively poor performance

Famous last words before someone creates a bug-ridden, slow replacement. See also the papers of Emery Berger et al.