Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.

The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.

Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).



You do not need modulo or division to implement non-power-of-2 ring buffers. Because you will only increment by one. So instead of "x = x % BufferSize" you can do "if (x >= BufferSize) x -= BufferSize;" or similar.

That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.


You could also do a predicated conditional move instead. Just do the subtraction every time, and use something like cmov to only do the write if you need to.

I don't know if it would end up being faster, though.


Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.

The general rule is to only use cmov if your test condition is mostly random.


I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.

https://godbolt.org/g/nyFLwp


It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).

Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.


If you use __builtin_expect() the Intel compiler uses a branch. I mean this way:

if (__builtin_expect(index >= cap,0)) { ...


Can't the CPU just continue execute out-of-order while waiting on the cmov data dependency to finish though?

In which case cmov would be relatively cheap since it isn't blocking execution of other instructions.


It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.

This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.


Just looked it up. Yes, from Broadwell forward a reg to reg cmov has latency 1. On Atom processors it is still 6.

If you decrement your array index you don't even need the cmp instruction. The compiler could probably gen so good code.


Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).


Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.

More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.


Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.


Modulo by a constant doesn't require a division, you can instead use multiplications, shifts, adds and subtracts. This transform is typical for compilers. For example, this is what gcc targetting x86_64 does to perform % 17 on the unsigned value in %edi:

  movl	$-252645135, %edx
  movl	%edi, %eax
  mull	%edx
  movl	%edx, %eax
  shrl	$4, %eax
  movl	%eax, %edx
  sall	$4, %edx
  addl	%edx, %eax
  subl	%eax, %edi


It doesn't require division, but how is that long sequence of instructions going to beat an AND?


That only works for compile time constants, though. With a power of two sized buffer you can just store the mask and decide how large you want your buffer to be at runtime.


libdivide (http://libdivide.com) implements similar logic at runtime


You can also compute the optimization at runtime, same as the compiler does. Libtheora contains code for this, used for quantizing video data.


The real problem with modulo is at the end of the integer range. Add one and overflow and suddenly jump to a totally different index in the array!

BTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that. To me, that seemed like the obvious way to do it when I needed it.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: