Making algorithms lock-free with Read-Copy Update (RCU)

gpderetta · on May 26, 2016

RCU is a very general solution to the concurrent reclamation problem, but it hardly makes any algorithm lock free.

Also I find the following troubling:

"[in the kernel] Even a concurrency scheme that nominally used spinlocks to protect critical sections would be lock-free, because every thread would exit their critical section in bounded time"

This ignores the possibility of deadlocks and livelocks. Also the lock-free definition requires that the system makes progress if other threads are halted; you can't just handwave the requirement away by claiming that your threads never halt; and even in practice you can't guarantee that a buggy NMI handler won't takes over the cpu effectively halting any thread.

edit: are never not halted -> never halt

ww520 · on May 26, 2016

Based on [1], thread running inside a critical section cannot be preempted by other threads on the processor. That means deadlock won't happen since nested critical sections are still run by the same thread until it exits the outermost critical section.

Thread halting is a not problem as long as it resumes to complete the critical section. Interrupt can happen. The interrupt state is saved to be handled later, the thread resumes to finish the critical section, and then the saved interrupt is delivered to the interrupt handler. The scheduler is interrupt driven anyway.

Thread killed while inside a critical section can be a real problem, as the critical section is not released, preventing other threads from running on the processor. I'm not sure how Linux handles it, whether it would clean up the critical section when the thread is killed or the killing waits until the critical section has finished.

[1] https://en.wikipedia.org/wiki/Critical_section#Kernel-level_...

gpderetta · on May 26, 2016

A thread can still deadlock with itself with non-recursive mutexes or spinlocks. Also assuming a single CPU system is no longer relevant today.

"Thread halting is a not problem as long as it resumes"

the whole point of a non blocking algorithm is that it gives guarantees if a thread doesn't resume for whatever reason.

HenryR · on May 26, 2016

The whole point of a non-blocking algorithm is system-wide forward progress. Yes, this is much harder to do if a thread can suspend for an unbounded amount of time, but that's kind of the point of the article: being in the kernel allows you to pull the kind of trick that makes that behaviour a non-issue.

(edit: s/per-thread/system-wide, since we're talking about lock-freedom).

gpderetta · on May 26, 2016

Technically non-blocking only guarantees system wide progress. Only wait-free algorithms guarantee per thread progress.

Anyway, even controlling preemption and disabling interrupts very much does not make a spinlocked critical section a lock free algorithm. That's not just a theoretical issue. We are dealing right now with a couple of machines where periodically the kflush kernel thread livelocks hard preventing the rest of the system from ever writing a page to disk requiring a hard powercycle.

vardump · on May 26, 2016

Indeed, spinlocks most definitely aren't lock-free.

"rcu_read_lock" doesn't ever block (as in wait for a spinlock), right? Sometimes you just need readers to never block and can pay higher price for other operations.

amelius · on May 26, 2016

From the article:

> So as soon as a thread yields its CPU, it’s guaranteed to be out of its critical section.

Therefore, the yielding can be seen as a blocking operation, meaning that readers do in fact block!

gpderetta · on May 26, 2016

yielding per-se is not a blocking operation, otherwise no algorithm would be non-blocking would under preemptive scheduling. It could be considered a potentially blocking operation if scheduling is exclusively cooperative.

amelius · on May 26, 2016

The CPU is a resource too, for which there is contention just as for other resources such as files. Therefore, at every machine instruction, you are implictly performing a potentially blocking operation. (In a pre-emptive environment.)

Unless you are in a critical section (in which your thread has exclusive control of the CPU). So the blocking may start again at the end of the section.

I hope this makes it clear.

mtanski · on May 26, 2016

On Linux you can guarantee (with some effort) that your process / thread will be the only thing running on a core.

There's a boot parameter that will let you black list CPUs on which to not schedule anything (by default). Then you use tasksel to schedule your process there as well. Finally, you can configure IRQ affinity so also doesn't handle interrupts on that core. It's not a 100% non pre-emptive but pretty close.

I believe there was work being to enable it to be even more accurately 100 but I haven't followed along.

gpderetta · on May 26, 2016

Any scheduler worth anything guarantees forward progress which makes preemption very different from actual blocking.

gpderetta · on May 26, 2016

I'm not very familiar with the kernel RCU, but I believe that not only rcu_read_lock does not block, it doesn't even issue any expensive memory barriers.

webaholic · on May 26, 2016

You are right. It does not even guarantee any mutual exclusion in the read critical section i.e., the data you are reading can be changed while you are in the critical section. RCU, as the article claims, does not work in all situations.

gpderetta · on May 26, 2016

One way to think of RCU is that it enables working with (a restricted form of) persistent data structures in languages without GC.

HenryR · on May 26, 2016

That's precisely the point: if threads never block or yield the CPU, you can guarantee system-wide progress.

If you have an algorithm which might deadlock, 'progress' isn't really well defined.

gpderetta · on May 26, 2016

Again, you can't ever guarantee that a thread never block or yield; it might have bugs, it might get a machine exception, it might have to deal with buggy hardware, someone might have attached an external debugger, etc. These might be rare events and not worth worrying about. And a spinlock might actually be the fastest implementation for your algorithm. It still doesn't make it lock free.

edit: ah, and you might be running under a virtualized cpu and the host takes the cpu away from you.

Edit: another one: SMM mode kicks in and takes the CPU away from the OS.

mtanski · on May 26, 2016

For any HN readers in NYC who have to care about stuff like this (system level)... you should come to ACM's Applicative 2016 conference. It's June 1st and 2nd at NYU. http://applicative.acm.org/

Paul McKenney is going to be presenting. He is one of the inventors of RCU, the author of Linux's RCU code. He's going to be presenting about RCU and write rates.

I'm really curious to hear that talk so that's the primary reason I'm going. But there's going to be other great systems / application talks.

abiox · on May 26, 2016

this seems really interesting, but i won't have the chance to attend - any idea if there will be recordings?

chrisseaton · on May 26, 2016

The ACM almost never records any talks. Not sure why. Maybe it's to allow people to debate more freely.

scott_s · on May 26, 2016

I attended Applicative 2015, and they did record the talks. For each talk, there is a video link: http://applicative.acm.org/2015/applicative.acm.org/speakers...

However, it is true that most ACM conferences do not record talks. That, however, is probably just because recording talks is expensive, and the conference organizers don't want to spend their budget on it.

derefr · on May 26, 2016

The generalized form of this concept is that there are advantages in having a non-preemptive execution model. You can implicitly do all sorts of crazy things without synchronization primitives in a non-preemptive system, because nobody will try to mess with any of the resources you're messing with until you yield execution—which you don't have to do until you're done.

The degenerate case of this is cooperative scheduling, which is painful in about the same way that manual memory management is painful. But the hybrid case, reduction scheduling ala Erlang, gets a lot of the same advantages (nothing pre-empting a function in the middle of a loop body) without the ability to forget to add yields (they happen at tail-call sites—which, for a language where tail-recursion is the only loop primitive, means your code will always have an O(1) runtime before a yield.)

javitury · on May 26, 2016

This ugly son of a b____ is optimizing super hot algorithms and you are f______ stupid. How?... Just read this free paper >>>

I wonder how long it will take until academical papers are published with clickbait titles.

jeffwass · on May 26, 2016

>I wonder how long it will take until academical papers are published with clickbait titles.

Seriously. It's the exact opposite of an abstract, which is to give a quick TL;DR summary so you can know if it's worth your time reading more about the paper.

gunnihinn · on May 26, 2016

I work for a scientific publisher. For April fool's this year we sent an email to all staff saying that to adjust to the realities of today's publishing industry, we would now start rewriting article titles into the more fashionable style. If they could also rework the contents into listicle form, that would be great.

It was not well received, so it worked as expected?

Chris2048 · on May 26, 2016

Add more pictures, remove any set-theory notations, rewrite in "explain like i'm 5" basic English, and people might actually start reading academic papers.

Startup idea?

mtanski · on May 26, 2016

I agree that the language of most papers is needlessly obtuse. In fact I doubt that reviewers understand the full concept when evaluating paper submissions.

Having said that, even if most papers were vastly simplified I doubt that many topics could be lowered to ELI5.

dexterdog · on May 26, 2016

It doesn't matter if it is worth your time. It was worth their time to throw it together so you better generate the eyeballs to pay for it.

mac01021 · on May 26, 2016

http://arxiv.org/abs/1404.5997

rndstr · on May 26, 2016

I skimmed through the article and didn't click the links. So there is that.

ww520 · on May 26, 2016

Two things.

1. A misplaced comma adds substantial confusion: "during a critical section, a thread may not block, or be pre-empted by the scheduler." This sounds like the thread may not block OR if it's blocked, the scheduler would preempt it, which doesn't make sense in regarding to the readers using the critical section. It should read: "during a critical section, a thread may not block and it would NOT be preempted by the scheduler." That just means a thread in a critical section is guaranteed to run to the end of the critical section without worrying about other threads preempting it. It owns the processor until the end of the critical section.

Then it makes sense. When the readers exit the critical section, they have done dealing with the old shared data. The writer's scheduled thread won't preempt the reader threads in critical section, and when it runs, it means the readers have finished.

2. Critical section is a form of lock. I don't see how it can be claimed lock-free. May be it should say critical section help avoid inter-processor locking.

simscitizen · on May 26, 2016

You can do a variant of this in userspace as well:

https://github.com/opensource-apple/objc4/blob/master/runtim...

Basically, issue a syscall that acquires the PC of all threads. When none of the PCs are in the critical section, then you can garbage collect old objects.

nkurz · on May 26, 2016

Hmm, if I'm understanding right, rather than passively checking whether any of the other threads are in the critical section, it's actually forcing all the threads to undergo a context switch so that the PC (instruction pointer) can be read. So long as the context switch is prevented until the thread is out of the critical section, the actual value of the PC is irrelevant so long as the call succeeds:

  #define PC_SENTINEL  1
  unsigned int count = x86_THREAD_STATE64_COUNT;
  kern_return_t okay = thread_get_state (thread, x86_THREAD_STATE64, (thread_state_t)&state, &count);
  return (okay == KERN_SUCCESS) ? state.__rip : PC_SENTINEL;

https://github.com/opensource-apple/objc4/blob/master/runtim...

If you have a lot of threads running, this could get really expensive. I presume it works, but this doesn't seem like an efficient approach unless your "writes" are extremely rare relative to the number reads and number of threads.

I haven't been able to find source for thread_get_state(), though. I don't think there is any way to get the current PC from a core without an interrupt? I presume it at least reads the PC for sleeping threads without needing to wake them?

simscitizen · on May 27, 2016

It absolutely can get expensive with lots of threads running. And you are also right that this code ends up making each thread reach a synchronization point. Luckily Objective-C is targeted towards clients which generally don't have a bazillion threads running so it's generally not a problem.

There was talk at some point of adding a single syscall that got all the PCs of all the threads at once to cut down on the overhead, but it looks like that still hasn't happened.

What's going on here is something like this:

1. objc_msgSend is Obj-C's method dispatcher, and it depends on method caches to make method lookup faster.

2. Sometimes the method caches become out of date. For instance, maybe the method cache filled up and the Obj-C runtime needs to allocate a larger method cache.

3. This old method cache has to be GC'd at some point, so it's added to a freelist.

4. This GC function (the "write" you're talking about) for these caches runs once the size of outdated method caches grows beyond a certain threshold (garbage_threshold in the code). So it shouldn't run too often. The GC function works by checking if any thread is currently within objc_msgSend. If no function is in objc_msgSend, then the runtime is sure that none of the method caches on the free list is in use.

5. It is definitely optimized for reads. The "read" in this case is literally a method dispatch so it happens all the time, and it's important for the read to be lock-free.

achivetta · on May 26, 2016

Until signal handlers get involved...

PaulRobinson · on May 26, 2016

Click-bait titles will never, ever get my up vote even if the content is OK.

Most of the web is a mess. Let's not make this corner like that stuff over there.

limaoscarjuliet · on May 26, 2016

The title seems like an obvious joke though.

bigredhdl · on May 26, 2016

I'm sure it is. That phrase is like nails on a chalkboard for me.

Hydraulix989 · on May 26, 2016

It does work though, unfortunately. I do the occasional social media post on my company's FB and LI pages, and the ones that with click-bait titles consistently do 2-3x better in view count -- all of my top performing posts have click-bait titles.

At least for now we can recognize/detect that our psyches are being preyed upon (soon we will not, as the art of clickbait evolves and starts using machines)...

It's a sad world we live in.

ricksplat · on May 26, 2016

>The general pattern is that you can prepare an update to a data structure, and then use a machine primitive to atomically install the update by changing a pointer.

Interesting. Analogous to the "two phase" commit used by Oracle, as opposed to the global lock used by SQL server.

(sorry if tpc is ubiquitous now, or SQLs doesn't use the locking any more - haven't worked with these in a while)

This is "lock free" certainly - but it still requires atomic updates, which is another concurrency primitive. Sorry not an expert but doesn't this still require some kind of locking under the hood? More efficient than explicit locking certainly but locking nonetheless.

EDIT I just re-read - so he's proposing that rather than locking you just prevent a thread from being context-switched by the scheduler. So you're kind of stepping back from definitive preemptive multitasking and introducing a cooperative element.

So you can't really make "any" algorithm lock-free - only where the scheduler lets you put it on hold, and where your code has the execution privileges to do that.

I think he's speaking specifically about Linux system code, but you have to delve into the details of TFA before that becomes apparent.

It seems fairly obvious then that if you have a more cooperative multitasking model then locking isn't really as necessary.

mtanski · on May 26, 2016

> Interesting. Analogous to the "two phase" commit used by Oracle, as opposed to the global lock used by SQL server.

I don't think 2PC is the right analogy here (which is really about distributed commit). But if you want to go use a database analogy MVCC. For example in LMDB there's many reads and one writer, the write finishes by updating swapping the root page and all future readers can see updates via that root page (analogous to the pointer with new version).

In fact the way LMDB is implemented is essentially RCU with a reader table and GCing of old pages once the oldest reader with access to that page goes away.

Yes, the author is talking about (Linux) kernel space where you have a lot more control over the environment so it's "easier" to use / build various concurrency primitives. Simple example: it's possible in the kernel to be in a spinlock loop with interrupts disabled and the holder will not be pre-empted. You can't make guarantees like that in user space.

gpderetta · on May 26, 2016

The problem with RCU is that the name is extremely misleading.

The name Read-Copy-Update describes the general way to concurrently update a shared data structure: make a local copy of the parts subject to change, modify the local copy then atomically substitute the original with the updated copy.

This is absolutely not exclusive of the RCU algorithm and in fact is pretty much how any lock free update of a data structure looks like (and even not lock free, see persistent data structures or MVCC).

A problem that lock free algorithms have is that they need a way to dispose of the now stale original node, which might be concurrently being accessed by other readers. There are many way to handle the disposal: for example hazard pointers, pass-the-buck, and full GC (this includes shared pointers).

RCU is one of these disposal algorithms and it uses epoch detection plus the contract that an accessor thread won't hold a reference across an epoch. Implementations of RCU can be very efficient for read mostly data structures because, on the reader side do not need expensive memory barriers (contrast with the store-load required to for hazard pointer updates even on the read side): a dependent load (the infamous load_consume) is enough.

JoshTriplett · on May 26, 2016

> EDIT I just re-read - so he's proposing that rather than locking you just prevent a thread from being context-switched by the scheduler.

There are multiple implementations of RCU. One of them (the earliest) uses this mechanism. However, Linux also has a fully preemptible implementation of RCU, in which readers do not prevent context switch at all; that implementation tracks grace periods differently, without relying on context switch. (But still without making rcu_read_lock/rcu_read_unlock expensive.)

vardump · on May 26, 2016

In kernel space, anything that helps to achieve (even limited) wait-freeness is rather welcome. You often don't have the luxury of long lasting locks. Short lasting (at most microseconds) spinlocks is usually the most important concurrency primitive you have.

So nice to read about techniques that can be implemented in kernel space. Inability to do a context switches often works against you there [1].

[1]: Many usermode waiting based constructs cannot be used, because they require yielding or pre-emption. You can't yield if you can't schedule (or block).

prewett · on May 26, 2016

Does "lock-free algorithm" just mean "won't deadlock"? It sure sounds like a standard locking algorithm to me, what with rcu_lock() and talk of critical sections. In fact, it seems like this is similar to having a read lock and a write lock. Am I missing something?

firethief · on May 26, 2016

https://en.m.wikipedia.org/wiki/Non-blocking_algorithm#Lock-...

The stronger guarantee you're expecting is probably "wait free"

bnastic · on May 26, 2016

Is RCU still patent encumbered, though?

mtanski · on May 26, 2016

I believe the URCU library is LGPLed and with IBM blessing: https://github.com/urcu/userspace-rcu/blob/master/lgpl-relic... . So you're probably okay if you use that, or a derivative work of that (that you contribute back).

There's other similar methods RCU like hazard pointers. There was a patent application filed, but it was abandoned before it was granted.

noselasd · on May 26, 2016

Yes, see e.g. https://www.kernel.org/doc/Documentation/RCU/rcu.txt and https://www.kernel.org/doc/Documentation/RCU/RTFP.txt

gpderetta · on May 26, 2016

I believe that the basic RCU patent has expired a few years ago. Some extensions might be still under patent. And no, I'm not going to do a google search to figure out what's patented or not (neither should you probably).

graycat · on May 26, 2016

Apparently RCU abbreviates "reed-copy-update".

nkurz · on May 27, 2016

Yes, it's an abbreviation for "read-copy-update", but unfortunately it doesn't really mean that at all. Rather, it's shorthand for something like "Now that I've written the updated version of the data to a new location and atomically switched a pointer so that all new readers will use it, how will I know when all previous readers of the old version have finished so that I can reclaim the space that the old copy was using?" It's a terrible name, but an almost magically efficient approach for certain problems.

nialv7 · on May 26, 2016

I think by 'lock-free' this author means 'deadlock free'

huhtenberg · on May 26, 2016

Crazy trick. I'll tell you what a crazy trick is.

It's when you are sitting, optimizing some bulk throughput rate in a router firmware, generally minding your own business and then a guy pops in and says that he just sped things up by a factor of 10. You get up, follow him to his cubicle and, lo and behold, it is 10x faster when blasted with a SmartBit stream. Ask him how he managed to achieve such a remarkable feat and he says - I just profiled the code, found a bottleneck and worked around it:

    // some_random_mutex.lock();
    
       ...

    // some_random_mutex.unlock();

Now THAT was a crazy trick.

Edit: to clarify - the lock was there for a reason. The whole thing whoopsed in an instant outside that specific test.

MereInterest · on May 26, 2016

Now I'm curious. Was the mutex actually necessary, or was the performance increase purchased at the cost of race conditions?

huhtenberg · on May 26, 2016

Yeah, the mutex was necessary and it was pretty bloody obvious from looking at few lines of context around those calls.

SomeCollegeBro · on May 26, 2016

Unfortunately, most of the time when you come across a mutex you have to assume it's necessary. It's very hard to regression test code after removing a mutex, chances are you are not going to encounter the race condition which the mutex protects against (unless it's well documented, of course).

gambiting · on May 26, 2016

Well to be fair I know of a certain game that was released on PS3, which ran into the hardware limit for locks inside the UI system. Someone thought of just commenting them out, and tested it - the whole UI worked fine, and we went from having thousands of locks in the UI system to having zero - and the game shipped like that.