Ya’ll should consider using atomic increment on separate cache lines instead of ...

rcxdude · on Jan 5, 2020

They solve different problems. A lock-free datastructure can be much fast than one with locks, but they depend on exactly what operations you can do fast and atomically (for example, the LMAX queue length is limited to a power of 2 because there's no way to atomically increment an integer which will wrap at arbitrary values, and the modulo operator is too expensive for non-power of two values). A lock is a primitive which allows you to make arbitrary operations 'atomic' (though in these benchmarks generally a simple increment is used). A lock-free queue may be a good replacement for a queue using locks, but this is only in one application of locks.

Also, the disruptor design is based on a system where there is a 1:1 correspondance between threads and cores. If this is not the case it will likely still interact somewhat poorly with the OS's scheduler without any blocking at all, for the same reason as spinlocks (in fact, if you use the library you can customize this behaviour).

extropy · on Jan 4, 2020

As far as I'm aware that's exactly how you would implement a spinlock.

Random Google search seems to validate that.

https://stackoverflow.com/questions/1383363/is-my-spin-lock-...

sillysaurusx · on Jan 4, 2020

Compare-and-swap isn't quite the same as atomic increment. An atomic increment can't fail; it always increments. Whereas CAS will fail if a different thread has modified the value.

The highest performance design is to use a ringbuffer. To write to the ringbuffer, you atomic-increment the "claim" counter, giving you an index. Now take index modulo ringbuffer size. That's the slot to write to. After you're done writing to it, set your "publish" counter to the index.

Each writer has a "publish" counter, and the "claim" counter isn't allowed to move beyond the lowest "publish" counter modulo ringbuffer size.

Each reader uses a similar strategy: the reader has a current counter. You find the minimum of all publish counter values, then process each slot up to that value, and set your counter to that value. The "claim" counter isn't allowed to move past the minimum of all reader counters.

Hence, everyone is reading from / writing to separate cache lines, and there are no locks at all. The sole primitive is atomic increment, which can never fail. (The only failure condition is "one of the slots hasn't been processed yet" (i.e. the publish counter is <= claim counter minus ringbuffer size) at which point everybody waits for the slot to be processed.

You can wait using multiple strategies: a spin loop (which isn't the same as a spinlock because you're only reading a value in memory, not calling any atomic primitives), yielding CPU time via sched_yield(), or calling sleep. All strategies have tradeoffs. Calling sleep is the least CPU intensive, but highest latency. Vice-versa for spin loop.

Takeaway: there are no locks. Just increments.

dfox · on Jan 4, 2020

From the architecture PoV it is exactly same thing if you think of it as implemented by LL/SC pair. This is how this is presented in most of literature and university courses. Then there is the purely practical issue of x86 not exposing LL/SC and instead having somewhat strict memory model and various lock-prefixed high-level instructions with wildly varying performance characteristics.

sillysaurusx · on Jan 4, 2020

Is it?

I'd be interested to see benchmarks showing that spinlocks with CAS have similar throughput to a ringbuffer with atomic increment.

Note that with the ringbuffer approach, each reader can process many slots at once, since you're taking the minimum of all published slot numbers. If you last processed slot 3, and the minimum publish count is 9, you can process slot 4 through 9 without doing any atomic operations at all. The design guarantees safety with no extra work.

It's not a minor trick; it's one of the main reasons throughput is orders of magnitude higher.

Benchmarks: https://github.com/LMAX-Exchange/disruptor/wiki/Performance-...

Beyond that, the ringbuffer approach also solves the problem you'll always run into: if you use a queue, you have to decide the max size of the queue. It's very hard to use all available resources without allocating too much and swapping to disk, which murders performance. Real systems with queues have memory usage patterns that tend to fluctuate by tens of GB, whereas a fixed-size ringbuffer avoids the problem.

shaklee3 · on Jan 5, 2020

This is also the reason DPDK uses a highly-efficient ring buffer. I believe (old benchmarks) it's faster than LMAX.

gpderetta · on Jan 4, 2020

I do not think x86 atomics are implemented as LL/SC internally. As a minimum they always guarantee forward progress: as soon the cacheline is acquired in exclusive mode (and the coherency protocol gurantees it happens in finite time), the load-op-write always happens and cannot be interrupted.

Also as far as I'm aware, at least on intel all atomic operations take pretty much exactly the same number of clock cyles (except for CAS and DCAS which are ~10% and 50% more expensive, IIRC)

dfox · on Jan 4, 2020

That is exactly my point. Any x86 SMP platform since Pentium is built on the assumption that truly exclusive access is possible. For the shared FSB platforms that is trivially implemented by global LOCK# and K8/QPI simply has to somehow simulate same behavior on top of some switched NUMA fabric (and this is one of the reasons why x86 NUMA coherency protocols are somewhat wasteful, think global broadcasts, and incredibly complex).

For context: before Pentium with its glueless 2x2 SMP/redundancy support there were various approaches to shared memory x86 multiprocessors with wildly different memory coherence models. (And some of the “lets design a board with eight 80386” are the reason why Intel had designed i586 to be glueless and such systems are probably still used to this day, althought unsupported)

gpderetta · on Jan 5, 2020

No, to implement x86 atomic semantics is the guarantee that a single cache line can be held in exlusive mode for a minimum lenght of time.

As forward progress is a pretty basic requirements, in practice even LL/SC platforms in practice do that, but is instead of having a single instruction with guaranteed forward progress you have to use a few special (but sometimes underspecified) sequences of instructions between the ll/sc pairs.

chrismonsanto · on Jan 5, 2020

> in practice

FWIW RISC-V guarantees forward progress for reasonable uses:

> We mandate that LR/SC sequences of bounded length (16 consecutive static instructions) will eventually succeed, provided they contain only base ISA instructions other than loads, stores, and taken branches.

gpderetta · on Jan 9, 2020

[sorry for the late reply]

what happens if those 16 instructions touch 16 different cache lines? I'm not an hardware expert (and even less on coherency protocols), but I think it would be extremely hard to make sure livelocking is avoided in all cases, short of having some extremely expensive and heavy handed global 'bus' lock fallback.

chrismonsanto · on Jan 9, 2020

Reading and writing memory are excluded from the guarantee, aside from the LR/SC instructions that bookend a transaction. Inside the transaction you're basically limited to register-register ALU and aborting branches.