The author has an implicit definition of "faster" which it is important to be aware of.
The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.
That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.
For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.
>The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.
The main use of spinlocks that I'm aware of is dealing with interrupt handlers in driver code. In this situation you generally can't go to sleep (sleeping with interrupts disabled is generally a good way of never waking up) so calling "mutex_lock" is simply out of the question. That's probably a niche use but that's literally the only situation I've ever used actual spin locks instead of mutexes, mainly for the reasons outlined by TFA.
I did some algorithmic/ high frequency trading and we were only spinning waiting for task and never, ever releasing to OS. But then the entire server was dedicated to one job and the job consisted of reacting to single stream of messages.
Most of the time I would say if lock performance impacts your application you are likely doing something wrong. There are many ways of solving typical synchronization problems without locks and if you need locks there are many ways to amortize costs.
Mutexes are not faster than spin locks the same way slower cars are not faster than fast cars. You might still crash in a fast car and be slower to the end of the race (to the supermarket) but that is just your failure to use the power you were given responsibly.
I personally go with "The Fine Article" in my mind when I write it, although obviously all these alternatives are possible. I first encountered this initialism on slashdot back when it was still relevant, although I don't know if it was coined there.
And yeah, I believe that it derives from "RTFA" (read the fucking article) which would be what you told people who obviously commented without reading the article.
And this only makes sense in a multi-core system (where the spinlock is held on another core). In a single core system if you are trying to take a contended lock in an interrupt you are stuck and can only fail to take it and hope to handle that gracefully.
Ya’ll should consider using atomic increment on separate cache lines instead of spinlocks. If you want to minimize latency to the bare minimum, atomic increment gives you two orders of magnitude measurable improvements over locks. https://lmax-exchange.github.io/disruptor/files/Disruptor-1....
They solve different problems. A lock-free datastructure can be much fast than one with locks, but they depend on exactly what operations you can do fast and atomically (for example, the LMAX queue length is limited to a power of 2 because there's no way to atomically increment an integer which will wrap at arbitrary values, and the modulo operator is too expensive for non-power of two values). A lock is a primitive which allows you to make arbitrary operations 'atomic' (though in these benchmarks generally a simple increment is used). A lock-free queue may be a good replacement for a queue using locks, but this is only in one application of locks.
Also, the disruptor design is based on a system where there is a 1:1 correspondance between threads and cores. If this is not the case it will likely still interact somewhat poorly with the OS's scheduler without any blocking at all, for the same reason as spinlocks (in fact, if you use the library you can customize this behaviour).
Compare-and-swap isn't quite the same as atomic increment. An atomic increment can't fail; it always increments. Whereas CAS will fail if a different thread has modified the value.
The highest performance design is to use a ringbuffer. To write to the ringbuffer, you atomic-increment the "claim" counter, giving you an index. Now take index modulo ringbuffer size. That's the slot to write to. After you're done writing to it, set your "publish" counter to the index.
Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.
Each reader uses a similar strategy: the reader has a current counter. You find the minimum of all publish counter values, then process each slot up to that value, and set your counter to that value. The "claim" counter isn't allowed to move past the minimum of all reader counters.
Hence, everyone is reading from / writing to separate cache lines, and there are no locks at all. The sole primitive is atomic increment, which can never fail. (The only failure condition is "one of the slots hasn't been processed yet" (i.e. the publish counter is <= claim counter minus ringbuffer size) at which point everybody waits for the slot to be processed.
You can wait using multiple strategies: a spin loop (which isn't the same as a spinlock because you're only reading a value in memory, not calling any atomic primitives), yielding CPU time via sched_yield(), or calling sleep. All strategies have tradeoffs. Calling sleep is the least CPU intensive, but highest latency. Vice-versa for spin loop.
From the architecture PoV it is exactly same thing if you think of it as implemented by LL/SC pair. This is how this is presented in most of literature and university courses. Then there is the purely practical issue of x86 not exposing LL/SC and instead having somewhat strict memory model and various lock-prefixed high-level instructions with wildly varying performance characteristics.
I'd be interested to see benchmarks showing that spinlocks with CAS have similar throughput to a ringbuffer with atomic increment.
Note that with the ringbuffer approach, each reader can process many slots at once, since you're taking the minimum of all published slot numbers. If you last processed slot 3, and the minimum publish count is 9, you can process slot 4 through 9 without doing any atomic operations at all. The design guarantees safety with no extra work.
It's not a minor trick; it's one of the main reasons throughput is orders of magnitude higher.
Beyond that, the ringbuffer approach also solves the problem you'll always run into: if you use a queue, you have to decide the max size of the queue. It's very hard to use all available resources without allocating too much and swapping to disk, which murders performance. Real systems with queues have memory usage patterns that tend to fluctuate by tens of GB, whereas a fixed-size ringbuffer avoids the problem.
I do not think x86 atomics are implemented as LL/SC internally. As a minimum they always guarantee forward progress: as soon the cacheline is acquired in exclusive mode (and the coherency protocol gurantees it happens in finite time), the load-op-write always happens and cannot be interrupted.
Also as far as I'm aware, at least on intel all atomic operations take pretty much exactly the same number of clock cyles (except for CAS and DCAS which are ~10% and 50% more expensive, IIRC)
That is exactly my point. Any x86 SMP platform since Pentium is built on the assumption that truly exclusive access is possible. For the shared FSB platforms that is trivially implemented by global LOCK# and K8/QPI simply has to somehow simulate same behavior on top of some switched NUMA fabric (and this is one of the reasons why x86 NUMA coherency protocols are somewhat wasteful, think global broadcasts, and incredibly complex).
For context: before Pentium with its glueless 2x2 SMP/redundancy support there were various approaches to shared memory x86 multiprocessors with wildly different memory coherence models. (And some of the “lets design a board with eight 80386” are the reason why Intel had designed i586 to be glueless and such systems are probably still used to this day, althought unsupported)
No, to implement x86 atomic semantics is the guarantee that a single cache line can be held in exlusive mode for a minimum lenght of time.
As forward progress is a pretty basic requirements, in practice even LL/SC platforms in practice do that, but is instead of having a single instruction with guaranteed forward progress you have to use a few special (but sometimes underspecified) sequences of instructions between the ll/sc pairs.
FWIW RISC-V guarantees forward progress for reasonable uses:
> We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches.
what happens if those 16 instructions touch 16 different cache lines? I'm not an hardware expert (and even less on coherency protocols), but I think it would be extremely hard to make sure livelocking is avoided in all cases, short of having some extremely expensive and heavy handed global 'bus' lock fallback.
Reading and writing memory are excluded from the guarantee, aside from the LR/SC instructions that bookend a transaction. Inside the transaction you're basically limited to register-register ALU and aborting branches.
Also if both threads are pinned to separate cores and nothing else is supposed to run on those cores, it is pointless to use anything but spinlocks as there is no other thread that could better use the core (and probably you do not want the core to go to a low power syate waiting for an interrupt).
In addition to energy, power use is another reason; parking a core will allow it to cool down thermally, so that when it is put back in use (milli)seconds later, it can run at a higher clock speed for longer.
That's not extremely low power compared to real low power states. The main advantage of PAUSE is the scheduling of the other hyperthread (if it exists) and maybe not generating a gratuitous L1 / MESI workload at a crazy rate (well if programmed correctly that should be quite cheap in lots of cases, but still...). To my knowledge this does not cut any clock, so the power economy is going to be minimal.
IIRC the mov imm, %ecx; rep nop sequences are somewhat special cased by modern architectures (and this fact is the only reason why you even would want to execute such code). On the other hand the energy savings are mostly negligible and it is simply an SMT-level equivalent of sched_yield()
Actually I heard that the last few generations of intel (from skylake) enter power state mode more aggressively with pause and the latency of getting out of a pause went up from tens of cycles to hundreds. No first hand testing though.
This is exactly the situation for a well-balanced parallel work queue. You want to start as many threads as there are cores and run them full tilt pulling work off the queue until it is empty. If you're running a large scale cluster that is dedicated to a particular task (e.g. like servicing a special kind of query, or encoding videos, rendering, etc), this is very common, or even a parallel Photoshop filter.
For this technique you generally dedicate some core(s) to those miscellaneous threads and flag the rest as unscheduable unless a thread is specifically assigned to them.
If you’re not sharing cores between VMs it’s typical to do the same at the hypervisor layer.
Yes, in practice you have to dedicate the whole machine for a specific application, but the one thread per isolated core is a proven one for high performance/low latency applications.
I wrote some software at an automated trading firm that would do the subset of "parsing ticks and maintaining order books" sufficient to map ticks to symbols, and no more work than that. My code was pinned to one core and spun waiting for an expensive network card to DMA some ticks into memory, then when ticks we cared about were found it would load them onto one or more queues. Each queue corresponded to a different thread spin waiting on its own pinned core.
So one use case is "transforming ticks into trading decisions slightly faster than you otherwise could"
Software industrial process control, such as motors and let robotics. The longer latencies make the control loop less responsive or unstable. Microseconds granularity is very useful there.
a friend of mine recently told me how Windows schedules cpu-bound threads to different cores to prevent thermal throttling, so I now wonder if we were mucking things up by running our CPUs too hot
If you weren't pegging thermal max you shouldn't be seeing thermal limiting. Although you're probably right that cooling is something software people are more likely overlook than SRE/hardware/lab folks.
Essentialy any true HPC workload. You can go pretty far by ignoring latency and doing some series of map and reduce stages on top of giant conceptual file with large-ish records. There are workloads where data for these partial parallel jobs are on the order of machine word or two and then you really care about communication latency between cores.
This experiment is a bit weird. If you look at https://github.com/matklad/lock-bench, this was run on a machine with 8 logical CPUs, but the test is using 32 threads. It's not that surprising that running 4x as many threads as there are CPUs doesn't make sense for spin locks.
I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.
Heavy contention run:
$ cargo run --release 4 64 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 64 10000 100`
Options {
n_threads: 4,
n_locks: 64,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 2.822382ms min 1.459601ms max 3.342966ms
parking_lot::Mutex avg 1.070323ms min 760.52µs max 1.212874ms
spin::Mutex avg 879.457µs min 681.836µs max 990.38µs
AmdSpinlock avg 915.096µs min 445.494µs max 1.003548ms
std::sync::Mutex avg 2.832905ms min 2.227285ms max 3.46791ms
parking_lot::Mutex avg 1.059368ms min 507.346µs max 1.263203ms
spin::Mutex avg 873.197µs min 432.016µs max 1.062487ms
AmdSpinlock avg 916.393µs min 568.889µs max 1.024317ms
Extreme contention run:
$ cargo run --release 4 2 10000 100
Finished release [optimized] target(s) in 0.01s
Running `target/release/lock-bench 4 2 10000 100`
Options {
n_threads: 4,
n_locks: 2,
n_ops: 10000,
n_rounds: 100,
}
std::sync::Mutex avg 4.552701ms min 2.699316ms max 5.42634ms
parking_lot::Mutex avg 2.802124ms min 1.398002ms max 4.798426ms
spin::Mutex avg 3.596568ms min 1.66903ms max 4.290803ms
AmdSpinlock avg 3.470115ms min 1.707714ms max 4.118536ms
std::sync::Mutex avg 4.486896ms min 2.536907ms max 5.821404ms
parking_lot::Mutex avg 2.712171ms min 1.508037ms max 5.44592ms
spin::Mutex avg 3.563192ms min 1.700003ms max 4.264851ms
AmdSpinlock avg 3.643592ms min 2.208522ms max 4.856297ms
The top comment opens up the concept of latency versus throughput. My interpretation is that this experiment is demonstrating that optimizing only for latency has consequences elsewhere in the system. Which is not surprising at all, but then again I spend a lot of time explaining unsurprising things.
I remember a sort of sea change in my thinking on technical books during a period where I tended to keep them at work instead of at home. I noticed a curious pattern in which ones were getting borrowed and by whom. Reading material isn't only useful if it has something new to me in it. It's also useful if it presents information I already know and agree with, in a convenient format. Possibly more useful, in fact.
If you only have 4 threads it is likely that all your CPUs are sharing caches and you won't see the real downside of the spinlock. They don't really fall apart until you have several sockets.
Modern day user-mode pthread mutex uses the futex kernel syscall [1] to implement the lock, which avoids the syscall in the non-contended case, so it can be very fast, acquiring the lock in the user-mode code entirely. I'm not sure whether the Rust's mutex API is a wrapper on the pthread mutex or calling the old kernel's mutex syscall directly.
Basically the user mode mutex lock is implemented as:
// In user-mode, if the lock flag is free as 0, lock it to 1 and exit
while !atomic_compare_and_swap(&lock, 0, 1)
// Lock not free, sleeps until the flag is changed back to 0
futex_wait(&lock, 0) // syscall to kernel
When futex_wait() returns, it means the flag has been set back to 0 by the other thread's unlock, and the kernel wakes my thread up so I can check it again. However, another thread can come in and grab the lock in the meantime, so I need to loop back to check again. The atomic CAS operation is the one acquiring the lock.
So this it acquires fully in userspace if there's no contention, it even spins if there is contention, and then if that wasn't enough it lets the thread sleep with the timeout.
"This library provides implementations of Mutex, RwLock, Condvar and Once that are smaller, faster and more flexible than those in the Rust standard library"
"Uncontended lock acquisition and release is done through fast inline paths which only require a single atomic operation.
Microcontention (a contended lock with a short critical section) is efficiently handled by spinning a few times while trying to acquire a lock.
The locks are adaptive and will suspend a thread after a few failed spin attempts."
The only thing that I'm missing is how often the sleeping threads wake, as a bad constant can increase the CPU use.
No, a thread that fails to acquire the mutex sleeps until the thread that is releasing the mutex explicitly wakes it. On Linux this is achieved via FUTEX_WAIT / FUTEX_WAKE.
It doesn't attempt to determine the size of the critical section, AFAIK. I think it's saying that sleeping immediately when failing to acquire the lock would significantly hurt small critical sections, and so instead they spin a few times to avoid punishing those scenarios.
The P in pthread is the P from Posix which stands for portable.
Now, the "zero syscalls for uncontended pthread mutices via futex" optimization is Linux specific and may not be replicated elsewhere. Or it may. It's not Posix, but I know for instance win32 critical sections look a lot like spinlocks when not contended but do syscalls to block, which sounds a lot like a futex. So that would put that technique as dating to the 1990s at the latest. Futex landed in 2002.
Avoiding pthreads because they aren't portable and replacing them with a spinlock sounds a little bit like madness, that was my point.
What is the non-posix target you have in mind? Windows? Seems like conditionally compiling against either pthreads or win32 critical section [or anything else] is a feasible thing and reasonable action. Maybe even the spinlock as a last resort.
"So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.
Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).
There's a reason why you can find decades of academic papers on locking. Really. It's hard."
We've been arguing about concurrency primitives literally for decades and the 'worst' part is that for most of that time, all of the competing solutions were documented by the same individual - Tony Hoare - within a narrow period in the early 1970's. Soon that will be 50 years ago. Watching people argue is like the People's Liberation Front of Judea scene in Life of Brian. As far as I know, borrow checking may be the first real change in that arena in decades, and I bet even that is older than most of us know.
I was getting heckled recently about my allergy to trying to negotiate interprocess consensus through the filesystem. I've seen similar conversations about how hard it is to 'know' the state of files, especially in a cross platform or cross filesystem way (see also the decade old fsync bug in PostgreSQL we were talking about early last year). In our case several dozen machines all have to come to the same decision at the same time (because round robin) and I was having none of it. I eventually had to tell him, in the nicest way possible, to fuck off, I'm doing this through Consul.
The thing is that people who generally don't learn from their mistakes absolutely do not learn from their old mistakes. So for any bug they introduce (like a locking problem) that takes months or quarters to materialize, they will not internalize any lessons from that experience. Not only wasn't I gonna solve the problem the way he wanted, but if he tried to take over it, we'd have broken code at some arbitrary future date and he'd learn nothing. He could not understand why I in particular and people in general were willing to die on such a hill.
Anger is not the best tool for communication, but as someone once put it, it's the last signal you have available that your boundaries are being violated and this needs to stop. Especially if you're the one who will have to maintain a bad decision. As often as I critique Linus for the way he handles sticky problems, on some level he is not wrong.
Back around 2012 I worked with a guy, a FreeBSD kernel committer, who insisted volatile was sufficient as a thread synchronization primitive. He convinced our boss.
Volatile doesn’t even guarantee that data is written atomically in one step and not e.g. byte-wise. Also it allows both the compiler as well as the CPU to reorder it with any read or write. I can’t think of anything that would it could be used for in a multithreaded environment.
There is really only a single place volatile actually works, and that is for memory-mapped hardware registers. Anybody who says it is useful for anything else is badly mistaken.
Except in MSVC, where it kinda/sorta means atomic.
It allows the compiler to reorder it with any non-volatile read or write. Of course it still doesn't indicate anything to the CPU. In single-core embedded systems where the CPU doesn't reorder anything and you know the compiler is going to emit a single instruction for a read or write it can be sufficient (for example, this is how FreeRTOS implements all of its threading primitives)
This definitely wasn't one. It didn't bite in any obvious way because Intel, and because the system already had so many other bugs. (Free advice: don't take on a C++ program written by a Java coder.)
This is a great example of how so many programmers think they are the first to discover something that is "ancient". In reality this is a problem older than Linux (linux just had the advantage of a huge and diverse open-source userbase from which to draw data). On the flipside, lots of good information takes years to go from discussion groups to books, so it can be hard to find relevant research. Applied compsci is literally the cutting edge of reality.
> Second, the uncontended case looks like
> Parking_lot::Mutex avg 6ms min 4ms max 9ms
This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 microsecond.
I have a lot of experience here from writing low-latency financial systems. The hot path we use is littered with uncontested mutex lock/unlock, and the whole path still runs under 20 microseconds. (With the vast majority of that time unrelated to mutex acquisition.)
The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.
You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread per lock.
That's still unreasonably high, isn't it? Even a Go sync.Mutex, not exactly a hot-rod implementation, can be acquired and released in < 50ns on the garbage hardware I have before me.
On Intel (and probably very similar on AMD) the cost of a completely uncontented, cache hit, simple spin lock acquisition is ~20 clock cycles while the release is almost free.
This absolutely makes sense in userspace. The most important part of a spinlock in an OS is that you can yield to the scheduler instead of taking up CPU time on the core. But that defeats the purpose of using a spinlock when in userspace because you still have to syscall
A spinlock in the kernel typically only spins. You use it for the cases when you can't schedule... So it better will be for a short time only.
But the concept of short time does not even exist deterministically in userspace, usually, because it can always be preempted. So don't use pure spinlocks in userspace, unless you reallyreally know what you are doing (and that includes knowing how your kernel works in great details, in the context of how you use it).
MONITOR/MWAIT will get you the part where a thread will pause and the cpu can rest until a write on a store on an address range, then the waiting thread is allowed to continue. You can't wait on a specific value though.
Note that all of the locks tested here are unfair, which is why they all show very high waiting variance. Until recently many mutex implementations aimed for fairness, which made them much slower than spinlocks in microbenchmarks like this.
Note that fair locking is still expensive. parking_lot achieves both speed and fairness by starting with unfair locking and falling back to fair locking.
This comes around every so often, and it isn't very interesting in that the best mutexes basically spins 1 or a couple times then falls back to a lock. It isn't a true pure spinlock vs pure lock (mutex/futex) fight.
I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.
His benchmark is weird, but maybe I'm reading it wrong:
* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).
* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.
* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.
The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.
What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.
(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)
The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.
And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.
A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.
(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)
The locks should be on separate cachelines, that's what the CachePadded::new is for.
futex cannot be implemented in the VDSO since it needs to call into the scheduler.
Another way to think about this: VDSO is used for syscalls that are (mostly) read-only and can avoid the kernel mode switch on a common fast path. The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.
> The locks should be on separate cachelines, that's what the CachePadded::new is for
I see that now. I looking for it in the AmdSpinlock struct, but that kind of makes sense.
> The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.
Ah. Thanks. I didn't know how far you could get with MWAIT, but I guess you still need to deschedule. I also didn't realize futex was a direct syscall and there was no user level api going on around it.
Is he running 32 threads even in the low contention case? And not pinning? There's something about his numbers that just seem a little too high for what I would expect. I've seen this around a lot, and the reason the mutex usually wins is that is basically does a spin of 1 or more then goes into a mutex (the pthread code appers to spin 100 times before falling back to a futex).
At work I use a spin lock on a shared memory region because it does test out to be lower latency than std::mutex and we're not under much contention. I've though about replacing it with a light futex-based library, but doesn't seem to be quicker.
He still seems to be getting some contention, and I'm trying figure out how.
> linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.
The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".
The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.
So a process first tries to acquire the mutex as for a normal spin lock (with a CAS, exchange, xadd or whatever work for the specific algorithm) attempting to set the lock bit to 1; if it fails (possibly after spin-trying a few times), it invokes the futex wait. As you can see the syscall is only done in the slow path.
On a futex release, the simple solution is to always invoke futex signal syscall after setting the lock bit to zero[1]. To fast path the relase, a wait bit can be set on the acquire path just before calling futex wait, so on the release path, when setting the lock bit to zero, signal would only be called if the waiter bit was set (and the cleared by together with the lock bit)[2].
As you can see the futex syscall is already the slow path and never need to be onvoked in the uncontented case. In fact the kernel doesn't even need to know about the futex untill the first contention.
[1] futexes are edge triggered, same as condition variables, so a signal will only wakes any thread blocked ona wait that happened-before the signal call. Thus there is a race condition if a lock rrlease and signal happens between the failed acquire attempt and the wait call; to prevent this futex wait as an additional parameter that is the expected value of the memory location: the kernel checks the futex address against this valur and will only if it hasn't changed will put the thread to sleep (this is done atomically).
[2] as there could me multiple waiters, a broadcast is actually needed here which leads to a thundering herd as all waiters will race to acquire the lock. The original futex paper used a wait count instead of just a bit but there are other options.
Calling this 'elision' is a bit confusing, since the glibc developers use the term 'lock elision' to refer to hardware lock elision via Intel TSX extensions.
What eliminates the syscall cost is optimistic spinning in userspace, so that locking only does futex_wait in cases of contention when the lock isn't released 'soon' after the thread starts trying to acquire it.
In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.
On NUMA, I think flushing is more expensive.
Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)
Yes, spin locks require atomic memory operations - as do many other OS primitives. CPUs have dedicated instructions to do this more quickly for locks than generalized coherent memory, for example cmpxchg on x86. There are microarchitectural optimizations to make these things work well.
I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.
Spin locks are very useful for locks held for short durations, and only on MP systems, particularly in the kernel, since a spin lock there can deadlock the whole system on a single CPU machine. The very general rule of thumb is that you generally want blocking locks, unless there is a very specific reason to use a spin lock. Those very specific reasons might be waiting briefly on some device in a driver, or running a realtime system where wake latency is critical, etc. Most of the time, spin locks will burn CPU for no good reason, and you still have to implement the non-spinlock case anyway, in case you are running on a 1 CPU system. So, might as well start with blocking locks, and only add spin locks where performance would benefit from it.
It doesn’t flush the entire cache (that would be a disaster) but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.
The real issue with spin locks is fairness. There’s no assurance that any given thread will ever make progress. A thread could starve forever. Production-ready mutexes like absl::Mutex make efforts toward fairness, even if they don’t have hard guarantees.
> but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.
Well, as long as you do the test-CAS instead of the pure-CAS approach not every loop iteration results in cache line bouncing.
Plus intel has introduced the MWAIT[0] instruction to implement something similar to futex in hardware, i.e. the hyperthread can sleep until another core updates the cacheline in question.
That's true, though MWAIT can only be used from kernel mode. At least the docs say that you get a #UD exception if you attempt to use it from user mode.
Aren't most modern spinlock implementations are ticket-based, which ensures fairness among waiters (FIFO-like)? The linux kernel implementations most definitely are.
This actually does not have to be this way. You could have a linked list of spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock. When the previous waiter is done it unlocks the next spinlock, and so on. The implementation gets a bit complicated on non-GC languages, since there are races between insertion/removal on the linked list. If the number of threads is fixed and long-lived then it becomes easier, since instead of a linked list you can have an array of spinlocks.
Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.
> you could possibly do away atomics, since (I believe) int updates are atomic by default.
The release can be a compiler only barrier followed by a simple store, but you do need an atomic RMW in the acquire path.
It is technically possible to implement a lock with just loads and stores (see Peterson lock[1]) even in the acquire path but it does require a #StoreLoad memory barrier even on Intel, which is as expensive as an atomic RMW so it is not worth it.
NetApp has a very interesting implementation RW spin lock inside the kernel. I tried optimizing to make it more fair for writers by introducing a single variable that would be check in each iteration.
The additional delay it added for checking the state resulted in deadlock timer triggering a dump! Hence, checking a variable is very expensive.
TFA makes the point that modern "mutex" implementations actually use spinlocks first and only fall back to heavy, kernel Mutexes if there is contention. So the title is click-baity. Mutexes are slower than spinlocks. The "faster than spinlocks" mutexes in this article are actually spinlocks that fallback to mutexes.
Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.
Spinlocks are great in situations such as:
* There are far more resources than threads,
* The probability of actually having to spin is small,
ideally if the time spent in the lock is a few instructions
* When you can't use optimistic concurrency*
* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works
There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.
Coming as a ruby developer who dabbles in async rust in my free time, these posts/threads/links have been my best read of 2020 so far. My CS curriculum barely covered real-world lock usage, much less the ties to linux/windows schedulers + modern CPU caching. Big thanks all around to these internet commentators.
As the person that added a bunch of functionality to spin-rs making it roughly match the std API, yes you should not use spinlocks in scheduled code.
That said, I see why Rust makes things so annoying. I want lots of code to work in no-std so have a robust embedded and kernel ecosystem. It would be really nice to abstract over the locking implementation with type parameters, but that required "higher kinded types" which Rust doesn't have. The alternative is relying on carefully coordinated Cargo features, which is much more flaky and hard to audit (e.g. with the type system). Given that, I am not sure people over-use spin locks.
Thread scheduling (or waking up cores) is slow.
Because of this, mutexes will look better on dumb benchmarks, as the contending threads keep going to sleep, while the single succesful owner has practically uncontended access
There are various degrees of slow, in addition to kernel being smarter about multiple cores and SMT siblings than your application.
Kernel can run your code on a cooled core, giving it higher clock, for example. Ultimately making it run faster.
Of course this won't show in a benchmark where all the threads do mostly calculation rather than contention, but that's not the typical case. That mostly shows up in compute such as multithreaded video where latency does not matter one bit.
Typically you have more of a producer/consumer pattern where consumer sleeps, and it's beneficial to run it on a cold CPU, assuming the kernel woke it up beforehand.
Source: hit some latency issues with an ancient kernel on a nastily hacked big little architecture ARM machine. It liked to overheat cores and put heavy tasks on the overheated ones for alleged power saving. (Whereas running a task quicker saves power.)
That’s what virtually all battle-hardened lock libraries do: spin for a bit (but not too tightly, using a pause in the loop) then fall back to waiter lists and futex.
The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.
That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.
For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.