I measured the system malloc crushing all other mallocs on some workloads and they were the kind of workloads that some folks run every day.
I measured the system malloc crushing most other mallocs on memory efficiency on most workloads. System malloc is really good at reusing memory and has very mature decommit policies that easily rival what I came up with.
What sort of workloads are those? I have yet to see a workload where macOS's system allocator is not substantially slower than alternatives like jemalloc.
macOS malloc was incredibly slow for Pathfinder, far behind every other OS. Everything became bottlenecked on it. It was a free 2x speedup if not more to switch to jemalloc.
I suspect this is because Pathfinder's CPU portion is a multicore workload and macOS allocator performs poorly when under heavy multithreaded contention. It probably just isn't the kind of workload that macOS allocator was tuned for.
My understanding is that the system malloc is excellent under contention, but has a high baseline cost for every malloc/free call.
I don't remember exactly what workloads it performed really great at (and if I did I dunno if I could say), but I do remember they were parallel, and the speed-ups got bigger the more cores you added.
Everything else about your experience matches mine. Libpas is much faster than system malloc in WebKit and JSC. The difference isn't 2x on my preferred benchmarks (which are large and do lots of things that don't rely on malloc), but it is easily more than 2x on smaller benchmarks. So your 2x result sounds about right.
I measured the system malloc crushing all other mallocs on some workloads and they were the kind of workloads that some folks run every day.
I measured the system malloc crushing most other mallocs on memory efficiency on most workloads. System malloc is really good at reusing memory and has very mature decommit policies that easily rival what I came up with.
So there’s that.