Compiler optimizations are hard because they forget

PoignardAzur · on Sept 25, 2022

> Identify a situation where a trick could be applied

> Build analysis that can find that situation (or a generalization of it)

> Apply the trick to all the places you can find

> Staple a hundred of those “optimization passes” together (hopefully reusing step 2 as much as possible) and suddenly your compiler starts doing really big and complicated changes as an emergent property of all the small ones!

Interestingly, it feels like there's a trend in compiler research to go in the opposite direction: a very small number of passes, where each pass applies every single item of a class of optimizations until fixpoint.

So instead of "merge_shifts -> replace_muls_with_shifts -> merge_shifts -> constant_fold -> merge_muls -> etc", you'd maybe have "constant_fold_and_merge_shifts_and_replace_mul_with_shifts".

I'm thinking especially of:

- egraphs: https://www.youtube.com/watch?v=m001XqQKyCQ&t=61s

- RVSDGs: https://www.sjalander.com/research/pdf/sjalander-tecs2020.pd...

Part of the trick is having a representation where some optimizations are a trivial property of the representation's structure, so more fixpoint passes can be replaced with forward passes (constructing and destructing the advanced representation).

(more on egraphs: https://egraphs-good.github.io/)

manv1 · on Sept 25, 2022

Compiler optimization at this point is probably pointless. If you're running an interpreted language than your JIT is more important than your compiler tricks (although you could argue your JIT is your compiler).

If you're on a multi-core or multi-threaded systems then memory/cache behavior is more important...and out of your control.

Nobody cares about CSE, constant hoisting, anymore, because speculative execution will obviate that.

For consumer stuff, you're waiting for the user for 99% of your cycles, so who cares.

For processing environments, overall throughput is more important.

IMO, compiler optimizations these days should mean outputting simple, straight-ahead code that's speculative execution friendly and accesses data sequentially. Work with the branch predictor and with the cache behavior of the CPU. Anything clever will founder at the chip level.

fear91 · on Sept 25, 2022

Please refrain from making those misleading statements. One can easily prove you wrong by disabling constant hoisting and CSE and showing it significantly slows down practical code(JSON parsing, gzip). I actually worked on CSE today and had benchmarks differ by +/- 30% just from that. All on a new Alder Lake. Your post is wrong and misleading.

dmitrygr · on Sept 25, 2022

>If you're on a multi-core or multi-threaded systems then memory/cache behavior is more important...and out of your control.

What do you mean "out of your control"?! How you lay out your data structures, what kind of locks you use is well within your control.

> Nobody cares about CSE, constant hoisting, anymore, because speculative execution will obviate that.

What fantasy land do you live in? Where do you get these magical unlimited-number-of-things-tracked OoO CPUs? Cause here in our world, speculative execution has limits and can only track so many in-flight instructions. Without optimization that will be less effective as you'll have more bloated unoptimized code (and more values) to track. Adding ability to track more things costs silicon and power! Without optimization you'll need to spill more values to stack as you'll be using registers less efficiently. More values on stack places more pressure on L1d, which pushes out other lines causing further perf loss.

So, no, your statement is approx 0% correct. Compiler optimizations matter a lot!