Writing Multithreaded Applications in C++

exDM69 · on Feb 7, 2015

Both of these examples exhibit the "busy waiting" anti pattern where a thread is waiting for something in a loop with a sleep or yield in it. This consumes CPU time, power and resources and may be sleeping more than necessary. The proper fix for this is to wait on a condition variable, which will make sure the thread remains inactive when idle.

This may be fine for a trivial example of std::thread and std::mutex but you should almost never do this in practice (there are some exceptions in kernel space when waiting for an I/O device).

I have a few useful rules of thumb when doing multi threaded programming which I want to share:

1) Almost every mutex should be coupled with one or (usually) more condition variables

2) A mutex variable named "mutex" is a code smell. Ditto for condition variables named "cond". They usually have an easy-to-describe function such as "queue_lock", "not_full" or "not_empty".

otikik · on Feb 7, 2015

> The proper fix for this is to wait on a condition variable, which will make sure the thread remains inactive when idle.

In case that phrase confuses others reading that: a Condition Variable is a sync-related construct - like Mutex or Semaphore. It does not mean "a regular variable that you put on an if". A very bad name for the concept in my opinion.

http://en.cppreference.com/w/cpp/thread/condition_variable

functional_test · on Feb 7, 2015

I'll start by saying that I agree with almost everything you've said. I can think of one time where a loop with a sleep might be appropriate though -- if you have something, perhaps some IO channel like a log, and you want it flushed regularly, but not every time a write is performed. With a long enough sleep (>100ms) the performance overhead ought to be low. Is there a better way to design that though?

std_throwaway · on Feb 7, 2015

In that case you don't want to sleep. You want to wait until a condition (new data) or a timeout (100ms) occurs.

The reasoning is that you don't know how much data will arrive in the next 100ms and if you sleep either a buffer could overrun or something else could block. Therefore you need to flush the buffer either when enough data has accumulated or if a timeout has occured.

functional_test · on Feb 7, 2015

I can see where you're coming from. For my particular apication that's not a concern (and I have sequence numbers to know when I've dropped data).

mhogomchungu · on Feb 7, 2015

Qt/C++ users can now use futures with .then() and .await() constructs through this tasks[1] library.

an example of .then() construct is here[2].

an example of .await() construct is here[3].What happens here is the following:

1. The function is suspected at that line.This suspension does not block the thread and hence the GUI remains responsive. 2. A thread is created and the lambda is run in the new thread. 3. When the lambda completes,the function is unblocked and finish up its work.

The above gives easy to read and non blocking synchronous code.

[1] https://github.com/mhogomchungu/tasks

[2] https://github.com/mhogomchungu/zuluCrypt/blob/d0439a4e36521...

[3] https://github.com/mhogomchungu/zuluCrypt/blob/eadd2643291f3...

Animats · on Feb 7, 2015

It's nice that C++11 and later have standard wrappers around pthreads. But that doesn't make multithreaded programming any easier. Threads still work the same way. All the usual problems, from race conditions to termination, remain. There's nothing comparable to Go's race detector or Rust's compile-time concurrency checking.

I've done a fair amount of concurrent programming in C++; I wrote most of the code for a DARPA Grand Challenge off-road autonomous vehicle a decade ago. That was fun. Some of the code is hard real time, and some isn't. Thread priorities matter. (This was on QNX, which is a hard real time OS.) There are some unusual techniques; for example, I had a "logprintf" function which was non-blocking. "logprintf" wrote to a queue being written to disk by another thread, and if the queue was full, "..." went into the log. "logprintf" thus could not delay a real-time task and could be used safely in hard real time sections.

pcwalton · on Feb 7, 2015

The Go race detector was pretty much just a port of Google's Thread Sanitizer for C++.

Too · on Feb 7, 2015

I'm curious what guarantees c++ gives on reading variables like the 'data' in his first example when it can be modified from elsewhere.

Yes, the lock will prevent concurrent access but will it prevent the compiler from optimizing away the read completely since main() was the last function that wrote to the variable. I've seen nasty optimization bugs like that happen in C with single threading and strict pointer aliasing and they are not easy to track down.

One could interpret his example program like this (unrolled and tweaked a bit).

    int data = 0; //ok data is now guaranteed to be equals to 0
    thread t(produce, &data); //called thread with a pointer to data so if i want to read data again i must invalidate my registers
    {
        lock_guard<mutex> lock(theLock);
        data = 5;  // i know i set data to 5 here
    }
    this_thread::sleep_for(chrono::milliseconds(500));  // call contains no references to data, sleep is a "pure" call
    {
        lock_guard<mutex> lock(theLock);    // call contains no references to data
        if (data == 5)  // i set data to 5 myself, and i haven't done anything that could touch data since then so we optimize away this branch
           cout << "data is 5";
    }

Will the call to lock_guard simply disable all such kind of optimizations? Or are you required to write memory barriers yourself in some way? I guess a simple way to look at it would be that the compiler assumes that thread will keep the pointer to data and that lock_guard has access to that pointer behind the scenes somehow, so that acquiring the mutex "invalidates" your last write, but that would mean acquiring any mutex would invalidate all variables as the mutex doesn't specify which data-sets it is locking. Am i thinking correctly? Where can i find more information regarding this.

jhdevos · on Feb 7, 2015

> Where can i find more information regarding this.

The C++ standard, of course :) But if you prefer some lighter reading, either http://www.cplusplus.com/reference/mutex/mutex/unlock/ or http://en.cppreference.com/w/cpp/thread/mutex/unlock explain what happens, in slightly different ways. I usually prefer the cppreference version, because it uses terminology from the standard more consistently, but there is no harm in reading multiple explanations about difficult concepts like these :)

I'd repeat the explanation here, but I think you'd best be served by reading one of the two links above.

Matthias247 · on Feb 8, 2015

As far as I understand locking/unlocking mutex inserts the necessary memory barriers.

jarjoura · on Feb 7, 2015

I would recommend using std::async instead of std::thread directly as it leaves it up to the STL implementation to manage a thread pool.

http://en.cppreference.com/w/cpp/thread/async

polskibus · on Feb 7, 2015

Le me recommend the Threading Building Blocks library from Intel - it's great for writing parallel solutions for data parallelism problems.

vkjv · on Feb 7, 2015

Zmq is most often associated with network communications, but its in process sockets are fantastic for multi-threaded programming.

easytiger · on Feb 7, 2015

I can't describe how important using ZMQ has been to architecting some pretty heavy lifting solutions. You end up with lots of very well defined boundaries in your application pipeline which are language agnostic.

fsloth · on Feb 7, 2015

Kudos! So elegant. Both C++11 and the article.

robmccoll · on Feb 7, 2015

OpenMP is still the fastest and easiest way to multithreading in C and C++. Got a for loop with no loop-carried dependencies?

#pragma omp parallel for

for(...)

The Wikipedia page isn't a bad starting point: http://en.m.wikipedia.org/wiki/OpenMP

SoapSeller · on Feb 7, 2015

OpenMP is terrible.

While you can argue whether introducing pragmas as control structure is a good idea or not, the performance issues with high load(>80% total cpu) and the caveats(two OpenMP versions in same app, multiple threads running OpenMP paralleled code & etc) are just not worth it.

TBB[0] is much better option(especially with C++11, where you can use lambdas to warp your to-be-paralleled-code and don't have to put it elsewhere), and you get all sort of goodies with it(e.g. thread-safe-containers).

[0] https://www.threadingbuildingblocks.org/

TillE · on Feb 7, 2015

I've had no trouble using OpenMP (with Visual C++ and GCC), though I admit I haven't rigorously benchmarked it compared to other solutions.

As far as I know, it's just syntactic sugar over a pretty simple thread pool model. What are the performance issues at high load?

goalieca · on Feb 7, 2015

Many years ago I had benchmarked openmp in comparison to other frameworks. IT was very easy but poorly performing.

flipcoder · on Feb 7, 2015

Using boost::coroutine, resumable functions are already possible. I've implented an AWAIT() macro in my scheduler and its usage looks like this:

https://github.com/flipcoder/kit/blob/master/toys/src/echo.c...

Implementation upon boost::coroutine here: https://github.com/flipcoder/kit/blob/master/include/kit/asy...

This is only a proof of concept, but it works.

jokoon · on Feb 7, 2015

What's the real reason for using threads today ? Why not use multi-process instead ? Or even OpenCL/CUDA if one needs to crunch numbers ?

The only real use-case for thread I can see right now, is for large game engine that need to process sound, graphics, networking, and inputs, all at the same time. What other type of application does need to make use of threads ?

chrisseaton · on Feb 7, 2015

In some application domains, the program conceptually operates on a big shared data structure. The usual example applications are mesh triangulation and mesh refinement [1].

For these kind of algorithms, it is not possible to divide work up into isolated jobs that can be run entirely independently of others, as until we start solving something like a refinement problem, we don't know what parts of the graph we will need.

That rules out OpenCL and CUDA, as they want to take a job and run it to completion without having to worry about what anyone else is doing.

It also makes multi-processes less attractive. If you have isolated processes and just pass messages between them, then you have the same problem. Do you have one process that handles the shared data structure? Well then you've effectively serialised your program. Do you divide up the data structure between processes? I wouldn't know how to do that as operations may span the partitions.

Perhaps you'll use shared memory between the processes? Well then you're effectively writing multi-threaded code and you might as well use threads.

In the end the easiest way to do things that we know of is to use multiple threads, shared memory, synchronisation primitives and optimistic algorithms.

[1] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. M. ndez-Lojo, D. Prountzos, and X. Sui, “The Tao of Parallelism in Algorithms,” presented at the Proceedings of the 32nd Conference on Programming Language Design and Implementation (PLDI), 2011.

aardvark179 · on Feb 7, 2015

Thank you for being a voice of sanity on this Chris.

There seems to be a terrible amount of cargo culting around safe concurrent programming and the reinvention of the wheel just because somebody heard the original wheel was bad (c.f. Processes and shared memory vs. threads), and the hope that CSP and similar schemes will act as a magik bullet.

In the end you need to think carefully about the design of your program, work out what races might occur, and try encapsulate any shared state mutation in as small an area as possible where it can be sensibly reasoned about. Remember that using concurrent data structures will not automatically make your code safe, put in comments saying why any shortcut you take is safe, and do really thorough code review.

huhtenberg · on Feb 7, 2015

> What's the real reason for using threads today?

Hahaha, good one ... oh, wait. Are you serious?

The real reasons are the same as they were, including separation of GUI and non-GUI processing in desktop apps, dispersal of bulk work (both computational and operational) across several cores, wrapping of blocking calls to make them asynchronous/cancellable, etc.

What is it that changed today that made you think that well-established multi-threading patterns are no longer relevant?

jokoon · on Feb 7, 2015

> What is it that changed today that made you think that well-established multi-threading patterns are no longer relevant?

Not patterns, but usage. I thought things like OpenCL would have made multi core CPU less relevant for intensive calculations, since GPUs are so much more powerful when you need to execute tasks at the same time.

And since processors were single threaded for a very long time, I was wondering what were threads being used for before multi core CPUs, and how did it really change things for GUIs. I mean threads were not really useful on single cores, were they ?

So it just seems to me that threads are just being made useful since multi cores CPUs, not really before. And unless you really need to take full advantage of the CPU when you're making a single app, your app is not always the single one being ran, cores can also run other apps not being threads. One simple software might not always need to take advantage of threads, unless it's a big software like a commercial 3d game.

Gurkenmaster · on Feb 7, 2015

>I mean threads were not really useful on single cores, were they ?

Every application you run has at least one thread or how did you think the main function is run? You won't be able to run multiple applications without multi-tasking.

>how did it really change things for GUIs.

GUI applications are usually single threaded. This means if you run anything blocking in that thread the entire application will freeze. The GUI thread can still continue if you run your blocking task in a worker thread.

The OS sheduler will take care of your threads even if you have only a single core to make sure everything remains responsive. (yes they were clearly talking about websites for mobile devices /sarcasm)

The only advantage of parallelism compared to concurrency is more performance.

jokoon · on Feb 8, 2015

> You won't be able to run multiple applications without multi-tasking.

But that doesn't mean you really need to make use of multi threading in your program, since the OS is already doing it.

> The only advantage of parallelism compared to concurrency is more performance.

Only if your program can be parallelized. And even if you do manage to parallelize it, in the best case, you'll only get a 4X or 8X performance increase, which is not a big order of magnitude.

rayiner · on Feb 7, 2015

GPUs are not suitable for many types of calculations.

jokoon · on Feb 8, 2015

I would be curious to have some examples.

Number crunching is almost always parallelizable. Computing is not always calculation. As long as you can simplify your program as a set of input/output models, OpenCL/CUDA will help a lot. Of course it's not a silver bullet, so often one might have to reorganize the problem to have a better performing solution.

The more you can tweak an algorithm to fit inside a set of stricter constraint, the better it will perform on a GPU.

Of course, if you have a hard problem, it's always better to be able to build hardware which targets a particular problem for optimal performance. Maybe I'd be more interested in multithreading when CPU will have at least 16 or 32 cores. 4 or 8 cores is not a lot.

angersock · on Feb 7, 2015

In addition to the reasons others have already covered here, most operating systems have relatively expensive process forking--whereas threads can (and have been) implemented entirely in userspace.

So, if your program settled on a model of one "process" per connection, you could rapidly run into slowdowns for forking/execing, and you could also make the OS really unhappy with the number of processes running around. Instead, using threads, the "process" lives only in application code, and you can have as many or as few as you'd like, and the OS really won't care.

This, plus all of the usual points about shared memory, easier ownership and communication, simpler programming model, etc. etc.

There are use cases for all the options--OS-level processes, heavy threads ala pthreads, green threads ala JVM, and fibers ala custom engine code and Erlang.

robmccoll · on Feb 7, 2015

Threads are one of the easiest paths to responsive applications. One thread for GUI and many for asynchronous background tasks(network, disk, compute heavy operations). Communicating and sharing between processes is more complex than between threads. Generally it is more OS-specific and processes have much higher overhead.

One easy example of a sane place for threading that would resonate with this crowd is HTTP servers. A thread pool to serve requests and a few asynchronous threads to handle DB operations and message queue interactions. Most frameworks are doing this for you (with real threads in languages that support it like C C++ Go Java and green threads in languages that don't Python Node/JS Ruby).

elsamuko · on Feb 7, 2015

You can share resources between two threads, this is harder with processes.

The use case is, that you don't block the UI while doing heavy or slow work, e.g. generating previews in an image organizer or fetching http requests in a browser.

Instead of std::thread, I'd rather recommend using std::function and a threadpool like this one:

http://threadpool.sourceforge.net/

rayiner · on Feb 7, 2015

Last multithreaded C++ app I did was high speed deep packet analysis. For our purpose, we could do packet parsing in separate threads, but needed to consult and update shared data structures to keep track of stream state. It's awkward to do that in separate processes, and it doesn't lend itself to a GPU.

thrownaway2424 · on Feb 7, 2015

A program written in a linear style with blocking calls running on multiple threads will be far easier to read and understand than an async program written with callbacks and whatnot. And if you require low latency from any composite operation then you'll want to decompose it and run each part on its own thread.