Basically Linus is right. One caveat is that calling sched_yield is a fantastic ...

gmueckl · on Jan 5, 2020

Interesting. A Win32 CriticalSection is also a two step primitive that spins for a short amount of time and then yielda to the system scheduler.

It would be interesting to see a distribution of wait time per lock for different locks. If such a two step approach performs well, a hypothesis might be that there are three kinds of locks:

- mostly uncontended: lock succeeds without spinning

- held for exceptionally short amounts of time: lock usually gets released again before waiting thread on different CPU stops spinning. Wait time is lower than the time it takes to yield to the scheduler and/or reschedule.

- held for a really long time: wait time is much longer than scheduling another thread and re-scheduling the initial one.

And what you are saying seems to indicate that optimizing for the first and second one is worth it because the overhead is negligible for the third type, that is, there are few locks that are held for an amount of time so that spinning adds noticable overhead.

YZF · on Jan 5, 2020

Do you have a reference for this? AFAIK the win32 critical section (like any modern mutex implementation) first uses atomic instructions to check if anyone is already in the critical section so it's really fast if no one is, and otherwise falls back to the OS synchronization objects.

gmueckl · on Jan 5, 2020

The information is right there in the documentation:

https://docs.microsoft.com/en-us/windows/win32/sync/critical...

The paragraph about InitializeCriticalSectionAndSpinCount and SetCriticalSectionSpinCount describe the behavior. IIRC, the default spin count used to be pretty high (1000 loops or so). Not sure if that was changed.

YZF · on Jan 6, 2020

Interesting. There's also that anecdote about the heap manager using a spin count of 4000. I wasn't aware this happened by default (I didn't see any mention of this in InitializeCriticalSection). I guess it's all down to the probability of contention vs. the amount of time the mutex is held.

pizlonator · on Jan 5, 2020

Your description and their description are not mutually exclusive.

YZF · on Jan 6, 2020

Spinning vs. a single check...

pizlonator · on Jan 5, 2020

I don’t have as much experience with the Win32 scheduler. But I remember getting conflicting data about the profitability of yielding. In particular, lots of threads yielding on Win32 can cost you a lot of overall system perf, if I remember right. Not so on Linux or Darwin, on the same workloads.

gmueckl · on Jan 5, 2020

I don't have too much experience with the Windows scheduler, either. I've written most of my multithreaded code exclusively for Linux.

kitsuac · on Jan 5, 2020

All of these little details vary dramatically depending on the exact CPU and workload. I've developed a wide variety of scheduling strategies and have used neural networks to predict when a given strategy will be better. Scheduling is giant non deterministic mess with no ideal answers.

pizlonator · on Jan 5, 2020

Ok but my data disagrees with you. Specifically: when apps get complex enough, the differences between CPUs and schedulers wash out in the chaos.

I’ve tested this over the course of a decade on multiple Linuxes, multiple Darwins, multiple x86 flavors (Intel, amd, and lots of core counts and topologies), POWER, and various arms, and on many large benchmarks in two very different languages (Java and C/C++). In Java I tested in in two very different VMs (JikesRVM and FijiVM). I think the key is that a typical benchmark for me is >million lines of code with very heterogenous and chaotic locking behavior stemming from the fact that there are hundreds (at least) of different hot critical sections of varying lengths and subtle relationships between them. So you get a law of large numbers or maybe wisdom of the masses kind of “averaging” of differences between CPUs and schedulers.

I’d love to see some contradictory data on similarly big stuff. But if you’re just saying that some benchmark that had a very homogenous lock behavior (like ~one hot critical section in the code that always runs for a predictable amount of time and never blocks on weird OS stuff) experiences wild differences between CPUs and schedulers then sure. But that just means there are no ideal answers for that scenario, not that there aren’t ideal answers for anyone.