All registers are basically always treated the exact same in OoO-land (though on...

All registers are basically always treated the exact same in OoO-land (though on x86-64 for some instruction forms it can add an extra byte of instruction encoding if there's some operand in r8..r15).

But there's still a massive amount of performance tuning. Some fun examples being how Haswell can only do one FP add per cycle, but two FMAs, so you can sometimes improve throughput by replacing some 'a+b's with 'a*1+b' (at the cost of higher latency). Or how, on older Intel, three-addend 'lea' can execute on only one port (as opposed to two for two-addend 'lea', or 3 or 4 for 'add') and has 3-cycle latency (vs. 1 cycle for two-addend; so two back-to-back 'lea's have lower latency than one three-addend one), but still uses only one port and thus can sometimes be better for throughput.