macOS malloc was incredibly slow for Pathfinder, far behind every other OS. Everything became bottlenecked on it. It was a free 2x speedup if not more to switch to jemalloc.
I suspect this is because Pathfinder's CPU portion is a multicore workload and macOS allocator performs poorly when under heavy multithreaded contention. It probably just isn't the kind of workload that macOS allocator was tuned for.
My understanding is that the system malloc is excellent under contention, but has a high baseline cost for every malloc/free call.
I don't remember exactly what workloads it performed really great at (and if I did I dunno if I could say), but I do remember they were parallel, and the speed-ups got bigger the more cores you added.
Everything else about your experience matches mine. Libpas is much faster than system malloc in WebKit and JSC. The difference isn't 2x on my preferred benchmarks (which are large and do lots of things that don't rely on malloc), but it is easily more than 2x on smaller benchmarks. So your 2x result sounds about right.
I suspect this is because Pathfinder's CPU portion is a multicore workload and macOS allocator performs poorly when under heavy multithreaded contention. It probably just isn't the kind of workload that macOS allocator was tuned for.