One caveat is that calling sched_yield is a fantastic throughput optimization for the spinning portion of an adaptive mutex. If you have parallel code that sporadically contends for short critical sections then my data has always said this:
- something like a futex lock is optimal
- you want barging, not fairness
- spinning the CPU is a t
questionable idea even if you do it for a short time. The pause instruction does something like jack and shit. Busy waiting of any kind seems great for microbenchmarks sometimes but it’s pretty bad for anything complex.
- spinning 40 times while calling sched_yield prior to futex_wait (or whatever) appears optimal on many schedulers and benchmarks. It’s hard to tell why. Best I can tell it’s because in the large parallel apps I’ve worked on and benchmarked, there are often threads that are ready to do some useful work right now, so yielding from a thread that can’t get a lock to one of those is a system throughput optimization. If there are no such threads then it’s a bit like spinning the CPU.
The 40 yields thing has a funny story for me. I learned of it from a comment in the JikesRVM code that was more than a decade old claiming it was optimal on their custom scheduler on a 12 core POWER box running AIX. The benchmark cited was something I’d never heard of. When I rewrote how JikesRVM did thread scheduling, I was sure that this would need tuning. Nope. It’s optimal on Linux and Darwin on x86 on basically any Java benchmark that has lock contention. I later implemented the same heuristic in WebKit’s locks. It’s still optimal.
To be sure, there is a broad plateau of goodness around 40. Like, 50 is about as good. But if you go order of magnitude in either direction, performance tanks. I’m superstitious so I’ve stayed with exactly 40 whenever I have to write this code.
Interesting. A Win32 CriticalSection is also a two step primitive that spins for a short amount of time and then yielda to the system scheduler.
It would be interesting to see a distribution of wait time per lock for different locks. If such a two step approach performs well, a hypothesis might be that there are three kinds of locks:
- mostly uncontended: lock succeeds without spinning
- held for exceptionally short amounts of time: lock usually gets released again before waiting thread on different CPU stops spinning. Wait time is lower than the time it takes to yield to the scheduler and/or reschedule.
- held for a really long time: wait time is much longer than scheduling another thread and re-scheduling the initial one.
And what you are saying seems to indicate that optimizing for the first and second one is worth it because the overhead is negligible for the third type, that is, there are few locks that are held for an amount of time so that spinning adds noticable overhead.
Do you have a reference for this? AFAIK the win32 critical section (like any modern mutex implementation) first uses atomic instructions to check if anyone is already in the critical section so it's really fast if no one is, and otherwise falls back to the OS synchronization objects.
The paragraph about InitializeCriticalSectionAndSpinCount and SetCriticalSectionSpinCount describe the behavior. IIRC, the default spin count used to be pretty high (1000 loops or so). Not sure if that was changed.
Interesting. There's also that anecdote about the heap manager using a spin count of 4000. I wasn't aware this happened by default (I didn't see any mention of this in InitializeCriticalSection). I guess it's all down to the probability of contention vs. the amount of time the mutex is held.
I don’t have as much experience with the Win32 scheduler. But I remember getting conflicting data about the profitability of yielding. In particular, lots of threads yielding on Win32 can cost you a lot of overall system perf, if I remember right. Not so on Linux or Darwin, on the same workloads.
All of these little details vary dramatically depending on the exact CPU and workload. I've developed a wide variety of scheduling strategies and have used neural networks to predict when a given strategy will be better. Scheduling is giant non deterministic mess with no ideal answers.
Ok but my data disagrees with you. Specifically: when apps get complex enough, the differences between CPUs and schedulers wash out in the chaos.
I’ve tested this over the course of a decade on multiple Linuxes, multiple Darwins, multiple x86 flavors (Intel, amd, and lots of core counts and topologies), POWER, and various arms, and on many large benchmarks in two very different languages (Java and C/C++). In Java I tested in in two very different VMs (JikesRVM and FijiVM). I think the key is that a typical benchmark for me is >million lines of code with very heterogenous and chaotic locking behavior stemming from
the fact that there are hundreds (at least) of different hot critical sections of varying lengths and subtle relationships between them. So you get a law of large numbers or maybe wisdom of the masses kind of “averaging” of differences between CPUs and schedulers.
I’d love to see some contradictory data on similarly big stuff. But if you’re just saying that some benchmark that had a very homogenous lock behavior (like ~one hot critical section in the code that always runs for a predictable amount of time and never blocks on weird OS stuff) experiences wild differences between CPUs and schedulers then sure. But that just means there are no ideal answers for that scenario, not that there aren’t ideal answers for anyone.
One caveat is that calling sched_yield is a fantastic throughput optimization for the spinning portion of an adaptive mutex. If you have parallel code that sporadically contends for short critical sections then my data has always said this:
- something like a futex lock is optimal
- you want barging, not fairness
- spinning the CPU is a t questionable idea even if you do it for a short time. The pause instruction does something like jack and shit. Busy waiting of any kind seems great for microbenchmarks sometimes but it’s pretty bad for anything complex.
- spinning 40 times while calling sched_yield prior to futex_wait (or whatever) appears optimal on many schedulers and benchmarks. It’s hard to tell why. Best I can tell it’s because in the large parallel apps I’ve worked on and benchmarked, there are often threads that are ready to do some useful work right now, so yielding from a thread that can’t get a lock to one of those is a system throughput optimization. If there are no such threads then it’s a bit like spinning the CPU.
The 40 yields thing has a funny story for me. I learned of it from a comment in the JikesRVM code that was more than a decade old claiming it was optimal on their custom scheduler on a 12 core POWER box running AIX. The benchmark cited was something I’d never heard of. When I rewrote how JikesRVM did thread scheduling, I was sure that this would need tuning. Nope. It’s optimal on Linux and Darwin on x86 on basically any Java benchmark that has lock contention. I later implemented the same heuristic in WebKit’s locks. It’s still optimal.
To be sure, there is a broad plateau of goodness around 40. Like, 50 is about as good. But if you go order of magnitude in either direction, performance tanks. I’m superstitious so I’ve stayed with exactly 40 whenever I have to write this code.
I wrote a lot more about this stuff once: https://webkit.org/blog/6161/locking-in-webkit/