An Introduction to Lockless Algorithms

dcolkitt · on March 5, 2021

One point that's frequently misunderstood about locks is that there's essentially no cost unless the lock is contested. This leads a lot of people to they must go lockless, when a simpler lock-based system would meet their requirements.

A lot of systems fit neatly into a paradigm of a single threaded hot path, plus a lot of infrequent peripheral threads that only have a low statistical contention cost.

For example let's say you want to process packets off the network very fast. But you also need some constant bounded garbage cleanup thread running periodically in the background. Maybe the thread runs every 30 seconds and takes 30 milliseconds per cycle. A lock won't have any effect on a 99th percentile SLA, because only 0.1% of inbound packets will experience contention.

andikleen2 · on March 5, 2021

>essentially no cost unless the lock is contested.

Atomics are not free. And fetching a cache line can be very expensive, even without contention.

However lock less algorithms are often worse because they have even more atomics and even more cache line transfers. And they're really hard to get right. (iirc there was a statistic somewhere about the average number of corrections in lock less papers -- it wasn't pretty)

Back to the non contended locks. Usually the way to get non contended locks is to push down the locks to smaller and smaller pieces of data. But imagine what happens when you have some contention (e.g. due to a freak hash collisions). Instead of paying the cache transfer cost once you'll pay it many times, and that can really ruin your day (speed of light is very limiting at nano second scales)

I wrote more about this here [1]

I am somewhat biased but I would recommend that every time you consider something lockless, consider using lock elision/HTM instead (see [2] and [3]). It often has most of the advantages with few of the down sides (but admittedly also some of its own quirks)

[1] http://halobates.de/applicative-mental-models.pdf [2] http://queue.acm.org/detail.cfm?id=2579227 [3] http://www.intel.com/software/tsx

wtallis · on March 5, 2021

> I am somewhat biased but I would recommend that every time you consider something lockless, consider using lock elision/HTM instead

Intel shipped two generations of processors with broken TSX before getting it right, and then timing attacks seriously called into question whether it was a safe feature to use in practice. Then Intel dropped it from their first post-Skylake microarchitecture products, though it looks like they're still planning to support it on their next generation of server CPUs. Is it really worth the trouble at this point?

andikleen2 · on March 5, 2021

Yes it totally is.

How much confidence do you have in your lockless code?

wtallis · on March 5, 2021

It's not about confidence in the code. It's about confidence in having hardware that supports TSX, and will also support TSX next year. As far as I can tell, you still need to ensure that you have correct and reasonably efficient fallback code for platforms that don't or no longer have TSX, unless you're working on a project that can get away with very narrow system requirements.

Put simply, TSX is not mainstream, and isn't on track to be mainstream anytime soon.

andikleen2 · on March 5, 2021

>Put simply, TSX is not mainstream, and isn't on track to be mainstream anytime soon.

Being in the vast majority of servers deployed in the last decade (Intel+POWER) is not mainstream?

wtallis · on March 5, 2021

Intel's only been shipping server CPUs with non-defective transactional memory for about five years, and a good chunk of those servers have it disabled for security reasons, or did at some point pending better fixes. IBM has dropped transactional memory from their latest POWER processors. Even if we only look at server CPUs, this does not look like a thriving ecosystem.

beagle3 · on March 5, 2021

I reach for lockless not when I fear contention, but when I have unbounded lock time. There could be no contention 98% of the time, and 500ms max wait another 1% of the time - I would happily use locks. But if the last 1% is unbounded (depends on hard disk, network, user input, etc.) - that's when I try to redesign it to have essentially bounded wait time, or reach for a lockless implementation.

hashmash · on March 5, 2021

And even more misunderstood is the cost of exchanging state between CPU cores. A lockless algorithm can be just as slow (or even slower sometimes) than a lock based design simply because of contention at the hardware level.

necubi · on March 5, 2021

Yes, lock-free datastructures are most useful in real-time contexts where you simply cannot ever block, for example real-time audio.

throw14082020 · on March 5, 2021

Is this a latency/ throughput related point?

jbluepolarbear · on March 5, 2021

I’ve come to dread lockless algorithms. They are hard to get right. Most uses I see lockless algorithms used can be replaced with asynchronous OS IO handling and run single threaded.

bitcharmer · on March 5, 2021

> For example let's say you want to process packets off the network very fast. But you also need some constant bounded garbage cleanup thread running periodically in the background

Sounds like a good use case for RCU

dragontamer · on March 4, 2021

pthread_join is a blocking call, and therefore against the spirit of lockless programing. pthread_join will block until the child-thread exists, which means there's a lock in there somewhere. I personally think the first example they show is a terrible example as a result.

------------

However, the smp_store_release and smp_load_acquire discussion is good... very good. But I'm not sure if there's enough discussion or theory to discuss why this is required.

The gist is: L1 caches on POWER9 or ARM are allowed to reorder loads/stores.

    thread 1                            
    --------------------------------  
    a.x = 1;
    message = &a;

"a.x = 1" should happens-before "message = &a". However, L1 caches do NOT always follow this logic. "message=&a" could happen first from the perspective of DDR4 RAM.

How? Why? Well, maybe message=&a is least-recently-used, while the thread continues to read/write to a.x variable. So later on, when the cache updates DDR4 RAM, the message=&a happens-before a.x=1 (from a DDR4 RAM perspective).

This never happens on x86 (the x86 Cores preserves all store orderings). But it is potentially possible on ARM or POWER9. (There are other possible reasons for reorderings: Out-of-order computations on the CPU, Load/store forwarding on the Core, as well as compiler optimizations).

It turns out that violating the total-store-ordering requirement allows for L1 caches and CPU-cores to operate slightly faster. Thanks to overly aggressive multicore processors and architectures (like ARM or POWER9), we have to deal with this non-obvious ordering violation.

---------

The barrier forces the happens-before relationship to be preserved. You need a barrier on two sides: the write side (release) and the read-side (acquire).

---------

Anyway... "happens-before" is about as bloody simple as you think it is. The issue is that "happens-before" is now violated on a regular basis on modern CPUs in multithreaded environments.

Forcing "happens-before" relationships to happen in the expected order is now a requirement for multithreaded programmers.

Note: all lock() and unlock() functions innately have the proper happens-before relationships in the right place. So a typical programmer who uses mutex_lock() will never have to worry about this happens-before reordering. All synchronization mechanisms: locks, semaphores, condition variables, etc. etc. need to be carefully written with these acquire / release barriers to work on modern architectures.

However, if you wish to step into the realm of high-performance lock-free algorithms, you must absolutely understand the modern processor, and how it violates happens-before relationships. (as well as understanding the memory barriers that RESTORE the happens-before relationship).

wtallis · on March 4, 2021

Bruce Dawson recently put a bit of a different spin on things with his explanation: https://randomascii.wordpress.com/2020/11/29/arm-and-lock-fr...

> If you have circa-2005 code without these memory barriers then your code is broken, and has always been broken, on all CPUs, because compilers have always been allowed to rearrange writes. With these memory barriers (implemented as needed for different compilers and platforms) your code is beautiful and portable.

> It turns out that ARM’s weak memory model really doesn’t make things any more complicated. If you are writing lock-free code and not using any sort of memory barriers then your code is potentially broken everywhere due to compiler reordering. If you are using memory barriers then it should be easy to extend them to include hardware memory barriers.

dragontamer · on March 4, 2021

The 00s were all about this multithreaded / multiCPU debate. For Java programmers, the solution was to establish memory-ordering at the language level (which then defined which optimizations a Java compiler was allowed, or not allowed to do).

C++ programmers chose performance, though it wasn't until C++11 when the memory model was fully standardized. Indeed: ARM CPU-designers were still arguing for a weaker model than acquire/release: ARM CPU-designers wanted to implement consume/release semantics instead. (consume/release is "weaker": allowing for more reorderings and therefore more opportunities for higher-performance CPUs and/or code optimizations).

Java's sequentially-consistent models are too strong (not enough optimizations available) for most programmers and most situations. Consume/release is in C++ standard, but few people understand it. Acquire/release is the model that seems to be winning out after all of these years.

As a result, ARM and POWER are updating their CPUs to better-provide acquire/release semantics. (as the consume/release model of ARMv7 is simply not as popular).

murderfs · on March 5, 2021

A useful memory_order_consume is essentially unimplementable, to the point where the C++ standard itself says to not use it: https://eel.is/c++draft/atomics.order#1.3

MaxBarraclough · on March 4, 2021

> ARM’s weak memory model really doesn’t make things any more complicated. If you are writing lock-free code and not using any sort of memory barriers then your code is potentially broken everywhere due to compiler reordering.

At the risk of mirroring dragontamer's comment, this figures. As that linked blog post explains, if you're programming in (modern) C++, you're writing for the standard C++ memory model, not the memory model of the target CPU architecture.

Perhaps a weak memory model will make things interesting for the compiler team, but that's not your concern. Perhaps there could be performance consequences, but that's a different matter than writing broken code.

CodeMage · on March 4, 2021

> pthread_join is a blocking call, and therefore against the spirit of lockless programing. pthread_join will block until the child-thread exists, which means there's a lock in there somewhere. I personally think the first example they show is a terrible example as a result.

But they didn't use it as an example of lockless programming, did they? They were explaining what "acquire" and "release" mean, and they wanted to explain it in terms the reader should already be familiar with.

In fact, they explicitly say it's not lockless:

    In the previous paragraph we have seen how the acquire and release semantics of pthread_create() and pthread_join() allow the creator of a thread to exchange information with that thread and vice versa. We will now see how this kind of communication can happen in a lockless manner while threads run.

anonymousDan · on March 4, 2021

But isn't there also the issue that the compiler could reorder (and not just the hardware)? Edit: just noticed you also mentioned that

person_of_color · on March 5, 2021

This is now fundamental apparently. I was failed out of an interview because I wasn’t able to make a lockless linked list using atomics.

ddlutz · on March 5, 2021

I'm curious where this is? Unless you are interviewing for a very specific position that listed lockeless programming as a requirement or you claimed to have 'very advanced multithreading' experience I can't see this being reasonable.

person_of_color · on March 5, 2021

Self driving car company

earthboundkid · on March 5, 2021

It must be noted that if self driving cars can't rely on library implementations of locking/lockless structures, they're even further from reality than it seems.

I feel like for all these questions the "correct" answer should be "I would never waste company time by faffing around reading papers about implementing lockless structures when I could be solving an actual business problem."

nyanpasu64 · on March 6, 2021

When you're dealing with low-latency real-time systems (I've worked with audio programming but not embedded systems, and self-driving cars are even more safety-critical), lock-free algorithms can be useful for solving the business problem of guaranteed maximum response times. (The issue is that lock-free algorithms are easy to design/implement incorrectly, resulting in heisenbugs which could be dangerous in a self-driving car.)

Many ad-hoc queues/channels available in pre-existing libraries contain locks, and are unsuited for real-time applications. Even among lock-free data structures, there are many tradeoffs (fixed-size ring buffers vs. dynamically allocated linked lists, SPSC, MPMC, and everything in between, or even using triple buffering instead of queues). Sometimes it's worth implementing an algorithm yourself, instead of scouring the Internet for suitable libraries. Though I don't know if an embedded RTOS's compiler libraries will supply headers for lock-free algorithms, or make you find your own.

nyanpasu64 · on March 5, 2021

I mean, I can do it, I've done it... but yikes. Where did you interview at, and for what kind of position?

NavinF · on March 5, 2021

I'm probably not the only one wondering why this is hard: Why not just declare prev/next as atomic pointers, CAS to delete, copy-and-CAS to insert, and call it a day right?

Turns out that doesn't work if you want to delete a node while another thread is inserting a node in that spot: https://en.wikipedia.org/wiki/Non-blocking_linked_list#Probl...

person_of_color · on March 5, 2021

Its not hard but Ive never used CAS instrinsics.

millstone · on March 5, 2021

Can you say more? Lockless linked lists are unreasonably hard for an interview problem.

rurban · on March 4, 2021

Thanks Paolo!

pdkl95 · on March 4, 2021

> For this to happen, the store and load must be endowed with release and acquire semantics respectively. To that end, we have the "store-release" and "load-acquire" operations.

I suppose release and acquire semantics are baked into Rust as the language's "ownership" concept. A variable is moved (release) out of the calling scope and the function takes ownership (acquire) at the function call, which synchronizes the symmetric operations. In the pthread_create/pthread_join example, thread 2 is borrowing the variable "s".

kmeisthax · on March 4, 2021

Sort of. Rust actually has special marker traits called Send and Sync which tell the compiler if a value can be passed between threads, and if so; if it can be accessed concurrently. It is the job of a type which implements Send/Sync to actually guarantee that any access to that type satisfies what the trait implies, as well as the rest of Rust's ownership and borrow semantics.

For example, the difference between Rc and Arc is precisely this marker trait. Rc just holds a box with a reference count and some other object in it, while Arc actually enforces atomic access to the reference count. You can't hold an Rc in multiple threads because it doesn't Sync accesses, and you can't Send it across threads because someone else might hold the same box. Meanwhile, both types do not provide mutable access to their contents at all: you need another locking primitive to ensure the contents of the box follows ownership and borrow semantics as the compiler intended, if only by punting the check to runtime.

(And yes, that means Rust technically has single-threaded locks. They're still useful, albeit extremely painful to work with.)

millstone · on March 4, 2021

acquire/release semantics restricts reordering of other loads and stores. That is, `a=1; b.store(2, release);` here the release operation on b affects the previous store to a. Rust's ownership model cannot express this, as it treats separate variables as independent.

Your analogy to borrowing is correct in the pthread case, because it has no race. But typically lockless programming has races, and I don't think it's useful to think of it in terms of ownership.

dragontamer · on March 4, 2021

Acquire/release semantics are an invention. They did not exist in ARMv7. I don't know the exact history of when or where that model became popular.

When I was in college in the mid 00s, reorderings were largely discussed in terms of "load/load", "load/store", "store/load" and "store/store" reorderings.

Turns out that thinking of things in terms of "load/load", "load/store", "store/load" and "store/store" is really hard and kind of unhelpful.

----------

I know that by C++11, acquire/release was finally formalized as a concept (maybe the C++11 committee borrowed the idea from somewhere else). I know the C++11 committee was heavily influenced by Java's memory model.

From my perspective, "acquire/release" semantics became popular BECAUSE RAII is popular in C++. RAII in C++ became ownership in Rust.

So in a sense, acquire/release is a cousin of ownership and RAII. But acquire/release was invented to solve a problem: a way to discuss reorderings without having to think about the awful load/load or store/store model that was popular in the 00s and 90s.

--------

Look at the memory fence instructions that were written long before C++11.

https://www.felixcloutier.com/x86/lfence

> Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

------

Now compare it to memory fence instructions written after C++11, like LDAR (ARM64)

> An LDAR instruction guarantees that any memory access instructions after the LDAR, are only visible after the load-acquire. A store-release guarantees that all earlier memory accesses are visible before the store-release becomes visible and that the store is visible to all parts of the system capable of storing cached data at the same time.

Formalizing the concept of "acquire" vs "release" really makes these ordering ideas more concrete to work with. The "acquire" barriers are the part needed when entering a critical section, and "release" barriers are the bit needed when leaving a critical section.

--------

No, its not identical to Rust Ownership or RAII from C++, but its kinda-sorta similar. More similar, and therefore more comfortable, to work with. More comfortable than "loads are ordered" (lfence) and "stores are ordered" (sfence) instructions at least.

millstone · on March 5, 2021

I just can't see any connection between RAII and acquire/release. For example both C and the Linux kernel have acquire/release, neither have RAII.

I think the terms "acquire" and "release" came from their connection to locking. Linux even used to call acquire and release fences "lock" and "unlock"; here's where it got changed: https://lore.kernel.org/linux-arch/20131217092435.GC21999@tw...

senderista · on March 5, 2021

I thought "acquire" just meant "a load cannot be reordered before this barrier" and "release" just meant "a store cannot be reordered past this barrier"?

dragontamer · on March 5, 2021

> "acquire" just meant "a load cannot be reordered before this barrier"

I'm pretty sure its "loads / stores cannot be reordered before this barrier". And technically, its "load-acquire", because the load-acquire is what the ordering is relative against. The C++ model does not match assembly-language of various architectures (though ARM64 was designed with C++11's memory model in mind)

Ditto with store-release. "loads/stores cannot be reordered past the store-release".

---------

    // Writing to protectedValue1 must occur after the lock.
    lock(); // load-acquire is part of all locks
    protectedValue1 = foobar();
    localVariable = protectedValue2;
    unlock(); // store-release is part of all unlocks
    // Reading protectedValue2 must occur before the store-release. If the protectedValue2 read occurs here, then some other thread may modify it.

------------

Citation: LDAR and STLR from ARM64, which applies to "any memory access".

johncolanduoni · on March 5, 2021

That, plus the synchronization effects of the two operations on the same address. It’s important to note that another thread which doesn’t access the same atomic with the appropriate ordering can observe an inconsistent state, even if it puts a full fence between each non-atomic operation (on architectures with weak memory models).

johncolanduoni · on March 5, 2021

I don’t think the intuition you get from Rust’s ownership or general RAII semantics will serve you well for understanding acquire and release semantics. Fundamentally they are about memory accesses other than the atomic operation they are applied to. And the scope of this relationship is not directly specified: it’s “all stores before this point” or “all loads after this point”. I don’t see what the analogue in RAII would be; RAII refers to a single object.