why is this not considered a compiler optimization and/or language problem? it s...

JonChesterfield · on June 21, 2024

What would you hope a compiler to optimise x % y into?

Higher level change-the-algorithm aspirations haven't really been met by sufficiently smart compilers yet, with the possible exception of scalar evolution turning loops into direct calculation of the result. E.g. I don't know of any that would turn bubble sort into a more reasonable sort routine.

vitus · on June 21, 2024

If y is always a power of 2 (as suggested in the comments), then I'd expect it to turn into an AND of some sort.

And more generally, with older architectures, integer division was much slower than integer multiplication, so compilers would generally transform this into a multiplication plus some shifts [0]. For context in that timeframe, MUL on Sandy Bridge introduces 3-4 cycles worth of latency (depending on the exact variant), compared to DIV introducing 20+ (per Agner Fog's excellent instruction tables [1]). So even computing x - y * (x / y) with the clever math to replace x/y would be much faster than just x%y. (It's somewhat closer today, but integer division is still fairly slow.)

[0] https://news.ycombinator.com/item?id=1131177 (the linked article 404s now, but it's archived: https://web.archive.org/web/20110222015211/https://ridiculou...)

[1] page 220 of https://www.agner.org/optimize/instruction_tables.pdf

wizzwizz4 · on June 21, 2024

> So even computing x - y * (x / y) with the clever math to replace x/y would be much faster than just x%y.

That only works when y is constant. Otherwise, you need to work out what to replace x/y with… which ultimately takes longer than just using the DIV instruction.

vitus · on June 21, 2024

> That only works when y is constant.

Excellent point! That said, that was the case in this particular example.

> This 16-bit modulo was just a final check that the correct number of bytes or bits were being sent (expecting remainder zero), so the denominator was going to be the same every time.

(Libraries like libdivide allow you to memoize the magic numbers for frequently-used denominators, and if on x86 you have floating point operations with more precision than you need for integer division, you can potentially use the FPU instead of the ALU: https://lemire.me/blog/2017/11/16/fast-exact-integer-divisio...)