> It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.
The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.
Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).
You do not need modulo or division to implement non-power-of-2 ring buffers. Because you will only increment by one. So instead of "x = x % BufferSize" you can do "if (x >= BufferSize) x -= BufferSize;" or similar.
That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.
You could also do a predicated conditional move instead. Just do the subtraction every time, and use something like cmov to only do the write if you need to.
I don't know if it would end up being faster, though.
Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.
The general rule is to only use cmov if your test condition is mostly random.
I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.
It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).
Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.
It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.
This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.
Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).
Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.
More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.
Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.
Modulo by a constant doesn't require a division, you can instead use multiplications, shifts, adds and subtracts. This transform is typical for compilers. For example, this is what gcc targetting x86_64 does to perform % 17 on the unsigned value in %edi:
That only works for compile time constants, though. With a power of two sized buffer you can just store the mask and decide how large you want your buffer to be at runtime.
The real problem with modulo is at the end of the integer range. Add one and overflow and suddenly jump to a totally different index in the array!
BTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that.
To me, that seemed like the obvious way to do it when I needed it.
The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.
Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).