Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
You could have invented futexes (tavianator.com)
144 points by dmarto on April 26, 2023 | hide | past | favorite | 53 comments


It's more like UNIX and Linux being way behind on threading technology. User space mutexes were in QNX over two decades ago. And on the UNIVAC 1108 half a century ago.[1] Here's a implementation from 1972, by John Walker.

    .         DIJKSTRA P FUNCTION
    .
    .
    .         LA,U      A0,<QUEUE>
    .         LMJ       X11,P
    .         <RETURN>                      X5 DESTROYED
    .
    P*        TS        QHEAD,A0            LOCK THE QUEUE
              LX        X5,QN,A0            LOAD QUEUE COUNT
              ANX,U     X5,1                BACK UP THE COUNT
              SX        X5,QN,A0            REPLACE THE COUNT IN THE QUEUE
              TN        X5                  DO WE NEED TO DEACTIVATE HIM ?
              J         PDONE               NO.  SKIP DEACTIVATION
              ON        TSQ=0
              LX        X5,QHL,A0           LOAD BACK LINK OF QUEUE
              SX        X5,QHL,X4           PUT INTO BACK LINK OF ACTIVITY
              SX        X4,QFL,X5           CHAIN ACTIVITY TO LAST ACTIVITY
              SA        A0,QFL,X4           CHAIN HEAD TO NEW ACTIVITY
              SX        X4,QHL,A0           MAKE THE NEW ACTIVITY LAST ON QUEUE
              CTS       QHEAD,A0            RELEASE PROTECTION ON QUEUE HEAD
    SCHDACT*  DACT$     .                   DEACTIVATE PROCESS
              OFF
              ON        TSQ
              C$TSQ     QHEAD,A0            WAIT FOR C$TSA
              OFF
              J         0,X11               RETURN AFTER ACTIVATION
    .
    PDONE     CTS       QHEAD,A0            UNLOCK THE QUEUE
              J         0,X11               RETURN
And that followed Djykstra's paper, published in Dutch in the late 1960s.

[1] https://www.fourmilab.ch/documents/univac/fang/


You can fast path an userspace mutex (or any blocking primitive) on top of pretty much any blocking system call (semaphores, pipes, poll, sigwait, whatever). The nifty thing with futexes and hashed waitqueues in general is that they consume kernel resources only when actively being waited on. It is not obvious (but certainly possible) that's the case with the UNIVAC implementation above.


Yup. The fast path trick was well known, BeOS made a big deal of its Benaphore, which is exactly this trick with its OS semaphore.

Exactly as you said, the problem with a Benaphore is that if your program wants 100 Benaphores, you're creating 100 OS-wide semaphores, even if your program is literally never experiencing contended locking, that's 100x scarce OS resource used up just in case. If you ask for one million Benaphores, the program won't run, the entire OS only allows (IIRC) 65536 total.

Putting together this fast path + the wait queue idea = a futex, very fast yet very cheap.


[too late to edit]

From the code comments, it is likely that UNIVAC is using something like solaris park/unpark and the queue is fully in userspace.


> User space mutexes were in QNX over two decades ago

Futexes were in Linux over two decades ago: https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pd... :)

And before that I believe glibc did something similar to what I described in the post, using signals


Note that futexes aren’t especially new to Linux anyway. They first appear in kernel version 2.6 released in 2003.


Hashing an address to map to a wait queue was done by early Unix kernels.

The address was called a "wait channel" (wchan for short).

There was no object declared at the address; any object or function could be waited on.

The ps utility still has a field called "wchan" you can select, which tells you where the process is waiting.

Hashing is basically way to add properties to an object (identified by an address) without physically extending the structure located at that address.

We can give any address a wait queue, for instance.

E.g. Javascript object properties are like this. If we have some futex object in Javascript, and give it a futex.waitQueue property, it's basically the same: the futex thing is hashed somehow to access that property.

The main point of futexes is efficiency in the face of user/kernel separation. Futexes are intended to be used in a double-checked pattern, avoiding a call into the kernel. User space can check the state of the futex memory location and avoid calling wait. The wait operation in the kernel will do the same check again (this time atomically) possibly avoiding the wait.. If that weren't done, there would be a lost-wakeup problem in the use of mutexes.

Futexes are fast in the uncontended case; hence the F. Fast means in the context of the expensive user/kernel boundary. You would never need futexes without such a boundary; they are not used in the kernel. Futexes inside a single protection space are indeed a non-invention; the invention is the idea that something outside of the protected space can inspect the value and act on it.


> If we have some futex object in Javascript, and give it a futex.waitQueue property, it's basically the same: the futex thing is hashed somehow to access that property.

I’m not sure that’s correct, it’s “waitQueue” that is hashed. In a super naive simple implementation the attributes are just a hash map.


Just a reminder that glibc / pthread has a known futex-related threading bug in the wild that has been open for three years: https://sourceware.org/bugzilla/show_bug.cgi?id=25847

It was discussed here two years ago: https://news.ycombinator.com/item?id=24958504

But I only recently believe I ran into it in the wild when testing some high-concurrency multithreading code for several days. It would run a few hundred million operations successfully and then hang, CPU busy, on pthread_cond_wait().


I have been trying to come up with an efficient but simple and correct condition variable implementation based on futexes for the last couple days and it's surprisingly tricky.


especially if correct means fully POSIX conforming, including realtime scheduling, priority inheritance, cross process support, robust mutex support, cancellation, etc.


Oh, blast from the past!

The original futex code used physical addresses, and pinned the page, which is simple: if you can share memory somehow, you can use a futex.

Of course, most people use pthreads and increasingly the idea of a Linux userspace locking primitive gave way to a glibc-pthread accelerator...


if you can share memory somehow, you can use a futex.

If you can share memory and you have coherent caches you can use a futex. (In particular, many embedded systems have weaker cache coherency than x86, so "it works on my laptop" is not enough!)

In the worst case, without coherent caches you may need to do a cache shootdown IPI, which means entering the kernel anyway.


Um, does Linux run on such systems? There's a broad assumption that atomic ops work efficiently (or at least, that was true when I dropped out 7 years ago...)


I'm not the right person to ask about this, but some ARM systems definitely have weaker coherence than x86 -- in FreeBSD we have a whole bunch of memory barrier primitives which compile away to nothing on x86 because they exist only for weaker platforms.

I have a vague recollection that it's something to do with whether caches snoop the memory bus for reads of lines they "own" but I could be mistaken. Whatever it was, there were cases where buggy code meant that a core could read a stale value from memory for multiple seconds after another core wrote to the same address.


The terminology is that x86 has a Total Store Order memory model, while ARM doesn't; acquire operations on x86 imply that all relaxed store operations on the core that released are also visible to the reader, which is not true on ARM unless the released value has a data dependency, or until you execute a memory barrier.

ARM still has a coherent cache, however. Basically every modern OS and program depends on having a coherent data cache (though ARM doesn't keep i-cache and d-cache coherent with each other, which basically only comes in play with self-modifying code)


Exactly. And futexes of course only require coherent caches, ot a strongly ordered memory model. In fact x86(i.e. TSO) is weaker than some


The details are hazy for me but all relevant CPUs have coherent caches, but not all make the same ordering guarantees.

x86 has "total store ordering", meaning stores made by core 1 will always be observed in-order by core 2. ARM doesn't make that guarantee.

In practice it doesn't matter for writing correct programs unless you write assembly: even if the CPU has total store ordering, the compiler is allowed to reorder stores unless you put an appropriate barrier in the high-level language source.


Linux does not work on systems without coherent caches between CPU accesses (some IO device incoherence is allowed). There is no cache shootdown IPI like that required[1].

ARM and x86 both can delay stores behind later loads. "Timeliness" of when stores might become visible is almost never specified exactly by any ISA or implementation (maybe some real-time CPUs, but Linux does not depend on such), but you will never get into the situation where a memory operation reads "stale" data beyond a fairly strict specification of memory consistency.

ARM does have some weaker ordering than x86, but this is all about how operations within a single CPU/thread behave. An ARM CPU can perform two loads out of order, and perform two stores out of order (with respect to what other CPUs can observe). Use barriers to order those, and you have ~same semantics as x86, and those barriers don't need to "reach out" on the interconnect or to the caches of other CPUs. Once the data leaves your store queues, both x86 and ARM, all other CPUs will see the result, because for it to be accepted into coherent caches, all other copies in other caches need to be invalidated first.

[1] Linux does work on some CPUs where instruction caches and/or the instruction fetch/execution pipeline are not coherent with data operations so there can be some situations on some CPUs where code modification may need to send IPIs to other CPUs to flush their instruction caches or pipelines, so speaking purely about memory data operations and data caches above.


Omg the multi-threaded code examples are amazing. The active line is highlighted. Amazingly illuminating, so so good.


Question: do people find that sort of animation easier to grok than showing the interleaving visually, as I do at https://outerproduct.net/boring/2023-01-27_trans-locks.html#...? I prefer the latter, because it lets me see everything at once, but may switch to the former presentation style if other people prefer it.


Not that it would apply in this particular case, but my all-time favorite visualization of concurrency is this 2016 blog post called "Visualizing Concurrency in Go" [0]. So good.

[0] https://divan.dev/posts/go_concurrency_visualize/


Using time to represent time is a pretty helpful & intuitive strategy, very nicely on display in this example & the submission.

It does require some waiting around to see how things develop (versus the at-a-glance approach) but it really feels so very graspable & clear to me.


I think for anything more complicated than my examples, animations would be worse than your explicit interleavings.

That said, I don't really think interleavings are the best way to show weak memory effects either, since they give the impression that there is some global order that is the same in all threads. I'm not sure how to do a nice visualization of the reality though.

One idea: show the interleaving of atomic operations, and based on that, show what values each other read might see. Maybe even let the user change the interleaving by dragging stuff around, and show how it changes which other values can be observed.


FWIW coming from some random stranger on the internet, I like the animations. They're short enough that even though all the information isn't on the screen at once, I can remember/interpret the whole thing after it's run once. I could also imagine for demonstrating longer-running bugs (of more than just a few instructions) a user-controlled slider might be nice.


The answer is almost certainly yes. For both directions of the question. Few things have a single visualization that is the best.


Apparently Solaris uses park() to the same effect. I can't readily find a history of either. Anyone know which came first, Solaris or Linux? In part I'm reacting to @Animats statement that UNIX is behind.


Being an old OS hacker of that day I was about for the release of both;

Linux was 1991 - as a minimal monolithic single arch student level kludge made from studying SunOS API calls and the MINIX project.

Solaris was 1993 .. a succusor to SunOS and a turn away from the SunOS BSD roots towards UNIX System V Release 4.

Both Linux and Solaris had deep prior art roots in the "Unix-like" sphere, many ideas cropped up over there originally but only really gained traction somewhere else.

( Other ideas of that time, such as capability based permissions, never really took off until decades later )


Sorry, I didn't mean when did Linux and Solaris first see light of day.

I meant, when did the futex (linux) and park() (Solaris) functionality become available in both? Some Linux things have an [uncredited] history as a Solaris feature first.


<doh> my bad, skimming too fast, reading too little.

To the best of my recollection, with some links I see that back this:

Futex appeared in Linux circa 2002 ( late kernel ver 2.5 series ), (Net) BSD _lwp_park() appeared circa 2009 but ... I have an inkling it popped up earlier in sources I don't have online access to right now (archived drives | tapes in a shed some distance away)

This post 2000 era is right when I moved away from heavy Sun workstation usage and low level tinkering and into other domains.


> capability based permissions, never really took off until decades later

Didn't all the unices of the day have C2-secure (and even tighter) "orange book" versions? Windows as well I believe. And definitely some other unices. That's so long ago I don't recall but didn't starting at C2 require MAC and this implies some kind of privilege system.

Solaris has had privileges (analogous to Linux capabilities) since 2005. Looks like Linux had it since 1999 but of more limited scope until 2008 when it got extended.

So I'd say "a decade", not "decades". ;)


I got into both Linux and SunOS around 1996. At the time, I got the impression (perhaps correctly so) that Linux was a "play thing" and mostly good for small businesses, while SunOS was the only thing that would properly scale to the type of volume an Internet business could expect.


As far as I can tell they are not the same abstraction.

park/unpark simply remove the thread from the run queue and require userspace to keep the waitqueue and identify which thread to wakeup. The semantics of futex is that the kernel will maintain the waitqueue. The advantage is that futex can be used across processes, not just threads and are possibly easier, on the other hand, high performance signaling primitives implementations probably want to maintain the waitqueue in userspace anyway, so park/unpark might be a better, if lower level, primitive.


I actually meant setpark but was being a little too concise. Yes park/unpark without setpark is not the same thing. But does setpark not offer the same functionality as futexes? I can find various references, university coursework mostly, that either imply or outright say these are the same thing.

That said, I was completely unaware that futex can work across processes as IPC! Sorry I'm being lazy here but I don't know if park/unpark/setpark offer that -- I doubt it. However on Solaris you have doors which is an ultrafast IPC.


I really loved the little UI element in this showing the concurrent executing paths as highlighted lines in the code (with soothing little transitions no less).


Seems so simple but I haven't seen that before either. Novel


There is a typo in the first code example, “memoy” instead of “memory”


Fixed, thanks!


Thanks for writing this up -- I'll be coming back to it when I need to implement mutexes.

As an aside -- it would be interesting to see what the analog on Windows is. I know from experience the platform-provided mutex implementation on win32 is horrendously slow, and it would be nice to know if it's even possible on that platform to implement a 'real' mutex that doesn't suck. Anyone know of a reference implementation of mutexes on Windows?


Windows has an analogue to futexes called WaitOnAddress.[1] Zig's stdlib claims SRW locks are faster than a futex-based implementation.[2] But the futex-based implementation a few lines down that should run fine on Windows since it uses the cross-platform futex abstraction, so it wouldn't be hard to benchmark.

[1]: https://learn.microsoft.com/en-us/windows/win32/api/synchapi...

[2]: https://github.com/ziglang/zig/blob/295b8ca467da36cd1066395e...


The Win32 mutex is slow because it's also usable across processes.

If you don't need cross-process sync, you should really be using a Critical Section instead. The kernel-level interface, which is largely undocumented, is I believe either an Event object or a Keyed Event on later versions.

There's some interesting related discussion on that here: https://microsoft.public.win32.programmer.kernel.narkive.com...

Anyone know of a reference implementation of mutexes on Windows?

Look in the WINE source.


It's worth noting that EnterCriticalSection on Windows already doesn't enter the kernel on the non-contended case. So very often these are used instead of the mutex kernel object.


Refer to 'futexes are tricky', which shows how to build a mutex with a futex—<https://www.akkadia.org/drepper/futex.pdf>—windows futexes are named WaitOnAddress, WakeByAddressSingle, and WakeByAddressAll.

(That said, I think I heard somewhere that windows mutexes are decent now?)


> I know from experience the platform-provided mutex implementation on win32 is horrendously slow

Sorry, but this sounds like FUD to me. First off, there is no single Win32 mutex implementation. The Win32 'Mutex' object is indeed heavy weight, but it is meant for inter-process synchronization. For thread synchronization you would rather use 'CriticalSection' or 'SRWLOCK'; they both avoid calling into the kernel in the uncontended case. SRWLOCK in particular is very fast.



That blog slightly misrepresents futex. Or perhaps it's outdated. It's written as if futexes can only be used with a 1/0/negative protocol, but that's not true. The values used are up to user space.


Indeed, in another article Chen actually explicitly claims that msft invented (what we understand to be) futexes and that Linux switched their futex syscall from semaphores to futexes by copying windows. It is truly a wildly incorrect and implausible view of history.


I don't know if this is the "best" recommendation, but it's the one I know.

https://crates.io/crates/parking_lot

IIRC the only platform specific part is the thread parking logic, so I'd assume it's pretty fast on Windows in the common case.


The rust standard library has incorporated a lot of parking_lot. On platforms that have futex-like primatives (like linux and windows) it uses those. But I think it uses a parking lot implementation if the OS doesn't support such an operation.


Why write workarounds instead of simply using arrays of 64 byte atomic structures for your data?

That way you avoid cache misses and cache invalidation at the same time. And your software can run at full speed across multiple threads/cores.


Because it's not always so simple to structure your solution in a way that's amenable to having multiple processors work on the data without cache conflicts. Locks are extremely common, they are not a "workaround" rather just another of many tools to control concurrency.

Also, few or no CPUs can perform atomic operations on 64 bytes of memory without using transactional memory (which has its own drawbacks), and in any case atomic operations and cacheline alignment don't save you from cache misses and invalidations if CPUs operate on the same data.


You don't modify the entire 64 bytes at the same time. That's why you fill it with atomic variables that you can modify _without_ locking.

You need the array to contain structures that fill the cache line to avoid invalidation when two cores write/read structures close to each other.

When two cores write/read to/from the same atomic variable in the same structure you of course need to know what the consequences will be, but most of the time it will only result on gliches and not crashes... say if the position of a player is updated by the physics and incoming network at the same time and read by the render core... the player will glitch to some location that might be wrong for the other writing core so you have to make sure that is handled to not behave in a non playable way, but performance on the CPU is the _only_ bottleneck we will ever have and we're already there.

You could also look at Javas concurrency package. It's mostly atomic.


Also futexes are not just for locks, but more generally are used for signaling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: