> It's unclear to me why the focus on a 2^n sized buffer just so you can use & f...

ambrop7 · on Dec 14, 2016

You do not need modulo or division to implement non-power-of-2 ring buffers. Because you will only increment by one. So instead of "x = x % BufferSize" you can do "if (x >= BufferSize) x -= BufferSize;" or similar.

That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.

tbirdz · on Dec 14, 2016

You could also do a predicated conditional move instead. Just do the subtraction every time, and use something like cmov to only do the write if you need to.

I don't know if it would end up being faster, though.

jnordwick · on Dec 14, 2016

Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.

The general rule is to only use cmov if your test condition is mostly random.

tbirdz · on Dec 15, 2016

I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.

https://godbolt.org/g/nyFLwp

jnordwick · on Dec 18, 2016

It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).

Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.

amadvance · on Dec 15, 2016

If you use __builtin_expect() the Intel compiler uses a branch. I mean this way:

if (__builtin_expect(index >= cap,0)) { ...

smallnamespace · on Dec 15, 2016

Can't the CPU just continue execute out-of-order while waiting on the cmov data dependency to finish though?

In which case cmov would be relatively cheap since it isn't blocking execution of other instructions.

gpderetta · on Dec 15, 2016

It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.

This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.

jnordwick · on Dec 18, 2016

Just looked it up. Yes, from Broadwell forward a reg to reg cmov has latency 1. On Atom processors it is still 6.

If you decrement your array index you don't even need the cmp instruction. The compiler could probably gen so good code.

amag · on Dec 14, 2016

Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).

gpderetta · on Dec 15, 2016

Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.

More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.

lorenzhs · on Dec 14, 2016

Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.

caf · on Dec 14, 2016

Modulo by a constant doesn't require a division, you can instead use multiplications, shifts, adds and subtracts. This transform is typical for compilers. For example, this is what gcc targetting x86_64 does to perform % 17 on the unsigned value in %edi:

  movl	$-252645135, %edx
  movl	%edi, %eax
  mull	%edx
  movl	%edx, %eax
  shrl	$4, %eax
  movl	%eax, %edx
  sall	$4, %edx
  addl	%edx, %eax
  subl	%eax, %edi

userbinator · on Dec 15, 2016

It doesn't require division, but how is that long sequence of instructions going to beat an AND?

lorenzhs · on Dec 14, 2016

That only works for compile time constants, though. With a power of two sized buffer you can just store the mask and decide how large you want your buffer to be at runtime.

_ihaque · on Dec 15, 2016

libdivide (http://libdivide.com) implements similar logic at runtime

nullc · on Dec 15, 2016

You can also compute the optimization at runtime, same as the compiler does. Libtheora contains code for this, used for quantizing video data.

jwatte · on Dec 14, 2016

The real problem with modulo is at the end of the integer range. Add one and overflow and suddenly jump to a totally different index in the array!

BTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that. To me, that seemed like the obvious way to do it when I needed it.