This is caused by hyperthreading. It's not an actual inefficiency, but an artifact of the way CPU time is counted.
The HT cores aren't real CPU cores. They're just an opportunistic reuse of hardware cores when another thread is waiting for RAM (RAM is relatively so slow that they're waiting a lot, for a long time).
So code on the HT "core" doesn't run all the time, only when other thread is blocked. But the time HT threads wait for their opportunity turn is included in wall-clock time, and makes them look slow.
Yeah, the earliest attempts weren't good, but I haven't heard of any HT problems post Pentium 4 (apart from Spectre-like vulnerabilities).
I assume OSes have since then developed proper support for scheduling and pre-empting hyperthreading. Also the gap between RAM and CPU speed only got worse, and CPUs have grown more various internal compute units, so there's even more idle hardware to throw HT threads at.
To be honest, I don't know. My understanding of 'user' time is that is represents the sum of all CPU time spent in "user mode" (as opposed to "kernel mode"). In theory, given that understanding and perfect scaling, the user time of a multi-threaded task should roughly match the user time of a single-threaded task. Of course, "perfect" scaling is unlikely to be real, but still, you'd expect better scaling here.
If I had to guess as to what's happening, it's that there's some thread pool, and at some point, near the end of compilation, only one or two of those threads is busy doing anything while the other threads are sitting and idling. Now whether and how that "idling" gets interpreted as "CPU being actively used in user mode" isn't quite clear to me. (It may not, in which case, my guess is bunk.)
Perhaps someone more familiar with what 'user' time actually means and how it interplays with multi-threaded programs will be able to chime in.
(I do not think faults have anything to do with it. The number of faults reported here is quite small, and if I re-run the build, the number can change quite a bit---including going to zero---and the overall time remains unaffected.)
Idle time doesn't count as user-time unless it's a spinlock (please don't do those in user-mode).
I suspect the answer is: Perfect scaling doesn't happen on real CPUs.
Turboboost lets a single thread go to higher frequencies than a fully loaded CPU. So you would expect "sum of user times" to increase even if "sum of user clock cycles" is scaling perfectly.
Hyperthreading is the next issue: multiple threads are not running independently, but might be fighting for resources on a single CPU core.
In a pure number-crunching algorithm limited by functional units, this means using $(nproc) threads instead of 1 thread should be expected to more than double the user time based on these two first points alone!
Compilers of course are rarely limited by functional units: they do a decent bit of pointer-chasing, branching, etc. and are stalled a good bit of time. (While OS-level blocking doesn't count as user time; the OS isn't aware of these CPU-level stalls, so these count as user time!)
This is what makes hyperthreading actually helpful.
But compilers also tend to be memory/cache-limited. L1 is shared between the hyperthreads, and other caches are shared between multiple/all cores. This means running multiple threads compiling different parts of the program in parallel means each thread of computation gets to work with a smaller portion of the cache -- the effective cache size is decreasing. That's another reason for the user time to go up.
And once you have a significant number of cache misses from a bunch of cores, you might be limited on memory bandwidth. At that point, also putting the last few remaining idle cores to work will not be able to speed up the real-time runtime anymore -- but it will make "user time" tick up faster.
In particularly unlucky combinations of working set size vs. cache size, adding another thread (bringing along another working set) may even increase the real time. Putting more cores to work isn't always good!
That said, compilers are more limited by memory/cache latency than bandwidth, so adding cores is usually pretty good. But it's not perfect scaling even if the compiler has "perfect parallellism" without any locks.
> Turboboost lets a single thread go to higher frequencies than a fully loaded CPU. So you would expect "sum of user times" to increase even if "sum of user clock cycles" is scaling perfectly.
Ah yes, this is a good one! I did not account for this. Mental model updated.
Your other points are good too. I considered some of them as well, but maybe not enough in the context of competition making many things just a bit slower. Makes sense.
User time is the amount of CPU time spent actually doing things. Unless you're using spinlocks, it won't include time spent waiting on locks or otherwise sleeping -- though it will include time spent setting up for locks, reloading cache lines and such.
Extremely parallel programs can improve on this, but it's perfectly normal to see 2x overhead for fine-grained parallelism.
I'd say there's still a gap in my mental model. I agree that it's normal to observe this, definitely. I see it in other tools that utilize parallelism too. I just can't square the 2x overhead part of it in a workload like Cargo's, which I assume is not fine-grained. I see the same increase in user time with ripgrep too, and its parallelism is maybe more fine grained than Cargo's, but is still at the level of a single file, so it isn't that fine grained.
But maybe for Cargo, parallelism is more fine grained than I think it is. Perhaps because of codegen-units. And similarly for ripgrep, if it's searching a lot of tiny files, that might result in fine grained parallelism in practice.
Yes, I know its fine. I just don't understand the full details of why hyperthreading slows things down that much. There are more experiments that could be done to confirm or deny this explanation, e.g., disabling hyperthreading. And playing with the thread count a bit more.
Hyperthreading only duplicates the frontend of the CPU.
That's really it. That's the entire explanation. It's useful if and only if there are unused resources behind it, due to pipeline stalls or because the siblings are doing different things. It's virtually impossible to fully utilize a CPU core with a single thread; having two threads therefore boosts performance, but only to the degree that the first thread is incapable of using the whole thing.
I know all of that. There's still a gap because it doesn't explain in full detail how contended resources lead to the specific slowdown seen here. Hell, nobody in this thread has done the experiments nexessary to confirm that HT is evem the cause in the first place.
Spinlocks are normal userspace code issuing machine instructions in a loop that do memory operations. It is counted in user time, unless the platform is unusual and for some reason enters the kernel to spin on the lock. Spinning is the opposite of sleeping.
User time is the amount of CPU time spent in user mode.
It is aggregated across threads. If you have 8 threads running at 100% in user mode for 1 second, that gives you 8s of user time.
Total CPU time in user mode will normally increase when you add more threads, unless you're getting perfect or better-than-perfect scaling.
There are hardware reasons even if you leave any software scaling inefficiency to the side. For tasks that can use lots of threads, modern hardware trades off per-thread performance for getting more overall throughput from a given amount of silicon.
When you max out parallelism, you're using 1) hardware threads which "split" a physical core and (ideally) each run at a bit more than half the CPU's single-thread speed, and 2) the small "efficiency" cores on newer Intel and Apple chips. Also, single-threaded runs can feed a ton of watts to the one active core since it doesn't have to share much power/cooling budget with the others, letting it run at a higher clock rate.
All these tricks improve the throughput, or you wouldn't see that wall-time reduction and chipmakers wouldn't want to ship them, but they do increase how long it takes each thread to get a unit of work done in a very multithreaded context, which contributes to the total CPU time being higher than it is in a single-threaded run.