NN-512 uses pthreads. It generates a tiny, standalone, highly scalable C99 work-stealing thread pool in every C file. The work items are simply coordinates, similar to the threadIdx/blockIdx coordinates in Nvidia's CUDA. For example, all the structs and functions starting with "Example32Threader" in this file:
The key is to avoid writing to shared cache lines whenever possible, and to keep a bitmap in the central struct (the "hub") that avoids rediscovery of the fact that a thread has no remaining work to be stolen (thereby avoiding a "thundering herd" type problem: https://en.m.wikipedia.org/wiki/Thundering_herd_problem)
Another key is to never take any short-cuts in synchronization. Rather than doing something subtle/clever to avoid mutexes (etc.), instead make sure the units of work are big enough so that synchronization costs are negligible with respect to the overall computation
> Rather than doing something subtle/clever to avoid mutexes (etc.), instead make sure the units of work are big enough so that synchronization costs are negligible with respect to the overall computation
This seems to be working well in my current project, even though I originally planned to refactor things to be "cleverer". A momentary Big Stupid Mutex lock followed by a good chunk of independent work is a nicely boring design for worker threads that seems to work fine.
Another trick I previously used was just leaving work involving contended resources aside until after all the worker threads finish, instead of trying to resolve contention with waits or more complex code. If the leftovers from each thread are <1% of the total work, it can just be cleaned up single-threaded style later. Sometimes it's enough to just detect potential contention and leave the trickier work to simpler code.
For folks interested in threads in C, I also recommend reading "Threads Cannot be Implemented as a Library" [1]. The summary is that Pthreads mostly works correctly as a library (1) because its functions execute memory fence instructions and (2) because its functions are treated as opaque functions by the C compiler, i.e., functions that might read or write from any global variable. However, these properties do not generate thread-safe code under a number of conditions such as under the presence of certain compiler optimizations. Thus, the paper argues that the compiler must be aware of threads and that threads cannot be implemented purely as a library.
The language basis for threading is now the C11/C++11 memory model [1], even if you don't directly use it--it's baked into the compiler IR. I would argue that in the 2020s, you should be using C11/C++11 threading instead of direct pthreads if you're writing new code.
(And yes, Hans Boehm was one of the people instrumental in getting the threaded memory model adopted.)
[1] If you really want to be pedantic, it's the current verbiage that rules, independent of the actual language version you request. This is one of those situations where trying to describe the semantics more formally is challenging, so a lot of the changes in the specification aren't supposed to actually be changing how things are compiled but how things are described in the text.
In some ways the POSIX API is better. For example, the POSIX API permits you to statically define and initialize pthread_mutex_t, pthread_once_t, and similar objects. Without static initialization you're stuck in a catch-22 because to initialize such objects at runtime you have to rely on some other implicitly serialized control flow, such as by invoking initializers from main(), which is particularly cumbersome and error prone for libraries.
C++ has it a little easier because, AFAIU, the language itself supports static constructors--i.e. the ability to run [mostly] ad hoc code to initialize static objects at runtime (specifically load time).
The C11/C++11 threading API lacks static initializers because when the specification was being developed Windows light-weight threading primitives didn't support such a capability. Basically, the C11/C++11 API had to take a lowest common denominator approach, which in most cases meant dropping useful features from the POSIX API so Microsoft didn't have to reimplement their primitives. (The hope was that C11- and C++11-compliant implementations would become widely available immediately if the standard made it easy to wrap existing OS primitives, but unfortunately implementations were--and arguably still are, especially for C--many years late.)
Also, the memory model is mostly distinct from the threading APIs. The memory model was more important for nailing down atomics in C11 and C++11. The threading APIs are much higher level, and because they rely on opaque types and function calls, their implementation and the necessary compiler support (ISA memory barriers, compiler code motion barriers, etc) was largely transparent.
Simple enough. Except... what if Thread#2 had the following:
while(i<2); // Infinite loop waiting for i to equal 2
foo();
Then in Thread#3:
while(i<3); // Infinite loop waiting for i to become 3
bar();
Then in Thread#4:
while(i<4); // Infinite loop waiting for i to become 4
baz();
Then in Thread#5:
while(i<5); // Infinite loop waiting for i to become 5
foobar();
As we can see here, "i" is a synchronization variable. We only know this fact if we know how another thread works. Now that i no longer steps from 1 to 2 to 3 to 4 to 5, your threads no longer synchronize and the code gains a race condition (all threads might execute at once, since i starts off as 5).
-----------
For better or worse, modern programmers must think about the messages passed between threads. After all, semaphores are often i++ and i-- statements at the lowest level (maybe with a touch of atomic_swap or maybe a lock-statement depending on your architecture).
Modern code must note when a variable is important to inter-thread synchronization, to selectively disable the Compilers optimizer (funny enough: it also is needed to strongly order the L1 cache, as well as the Out-of-order core of modern processors).
As such, proper threading requires a top-to-bottom language-level memory model.
The "knowledge" that the i++ cannot be optimized / combined beyond the sleep statements.
---------
This is no longer an issue on modern platforms. Today, we have C++11's memory model which strongly defines where and when optimizations can occur, with "seq_cst" memory ordering.
There is also a faster, but slightly harder to understand, memory model of acquire and release. This acquire / release memory model is useful on more relaxed systems like ARM / POWER9.
Your mutex_lock() and mutex_unlock() statements will have these memory-barriers which tell the compiler, CPU, and L1 cache to order the code in ways the programmer expects. No optimizations are allowed "over" the mutex_lock() or mutex_unlock() statements, thanks to the memory model.
But back in 2004, before the memory model was formalized, it was impossible to write a truly portable posix-thread implementation. (Fortunately, compilers at the time recognized the issue and solved it in their own ways. Windows had Interlock_Exchange calls, GCC had its own memory model. But the details were non-standard and non-portable).
Good succinct tutorial. For people interested in studying Concurrency (Threading is just a subset of the model) i highly suggest the following book;
Foundations of Multithreaded, Parallel, and Distributed Programming by Gregory Andrews.
On the low-level OS/Hardware side (i.e. SMP/Caches etc.) UNIX Systems for Modern Architectures: Symmetric Multiprocessing and Caching for Kernel Programmers by Curt Schimmel is excellent.
C# should use Async and Await. Steven Cleary's book is the standard for C#: Concurrency in C# Cookbook: Asynchronous, Parallel, and Multithreaded Programming
For Windows, I liked: Concurrent Programming on Windows by Duffy
Wonderful easy to use library which is built on top of rfork and uses channels and alternates to synchronize the procs. Heap memory can be shared across processes (e.g. malloc'd memory or a segment(3)) so passing data around is just passing pointers to objects in shared memory.
Further reading which demonstrates the library and how it influenced Go can be found here: https://seh.dev/go-legacy/
Ditto, via David Butenhof's excellent book _Programming with Posix Threads_. I didn't know what a mutex or condition variable was before I read that book.
I was going to comment the exact same thing. I have fond memories of learning about parallelism in university and using pThreads. It was such a mind-blowing and novel thing at the time when all I had been exposed to was sequential single threaded programming.
This wonderful tutorial is the same I used to learn Pthreads around 10 years ago.
The training section of the LLNL website [1] offers a lot of solid tutorials; I used the OpenMP and MPI ones for my Parallel Computing class some years ago and they helped me to sistematically learn (and learn to apply) the material. I love them!
This is a great resource, and it's also well written. I did a lot of C thread programming 22 years ago (plus some CORBA), and this would have been very helpful, basically resumes a whole semester of university classes in one web page!
Lazy question: are there any details of how pthreads are implemented for each operating system?
I recently got excited about OpenMP[0], which seems to let one take advantage of threads while handling some of the drudgery of setup in common patterns.
NN-512 uses pthreads. It generates a tiny, standalone, highly scalable C99 work-stealing thread pool in every C file. The work items are simply coordinates, similar to the threadIdx/blockIdx coordinates in Nvidia's CUDA. For example, all the structs and functions starting with "Example32Threader" in this file:
https://nn-512.com/example/32#3
The implementation ends at Example32ThreaderDo1
The key is to avoid writing to shared cache lines whenever possible, and to keep a bitmap in the central struct (the "hub") that avoids rediscovery of the fact that a thread has no remaining work to be stolen (thereby avoiding a "thundering herd" type problem: https://en.m.wikipedia.org/wiki/Thundering_herd_problem)
Another key is to never take any short-cuts in synchronization. Rather than doing something subtle/clever to avoid mutexes (etc.), instead make sure the units of work are big enough so that synchronization costs are negligible with respect to the overall computation