You could also do a predicated conditional move instead. Just do the subtraction...

jnordwick · on Dec 14, 2016

Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.

The general rule is to only use cmov if your test condition is mostly random.

tbirdz · on Dec 15, 2016

I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.

https://godbolt.org/g/nyFLwp

jnordwick · on Dec 18, 2016

It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).

Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.

amadvance · on Dec 15, 2016

If you use __builtin_expect() the Intel compiler uses a branch. I mean this way:

if (__builtin_expect(index >= cap,0)) { ...

smallnamespace · on Dec 15, 2016

Can't the CPU just continue execute out-of-order while waiting on the cmov data dependency to finish though?

In which case cmov would be relatively cheap since it isn't blocking execution of other instructions.

gpderetta · on Dec 15, 2016

It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.

This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.

jnordwick · on Dec 18, 2016

Just looked it up. Yes, from Broadwell forward a reg to reg cmov has latency 1. On Atom processors it is still 6.

If you decrement your array index you don't even need the cmp instruction. The compiler could probably gen so good code.