I mention that to point out the code density limits of x86 are much higher than what measurements using compiler output will show, while on the other hand I haven't seen the same for ARM and suspect that one can't really get much better than compiler output for it or other RISCs.
Having had to patch binaries on multiple occasions by inserting instructions, it is definitely not hard to do so for x86 as one can easily find "slack" that the compiler left behind[1], but I once had to do it for a MIPS binary, and it was definitely not easy to squeeze in the few extra instructions I needed inline; I ended up having to detour to another area with jumps instead.
Here's an old paper where the authors tried to optimise for code density manually, and you can consistently see x86 beating ARM and MIPS:
Yeah if code size is the only metric you care about. The second link is an excellent example of code you do not want a compiler to generate by default. Like, besides all the well-known performance pitfalls of microcoded instructions, jeczx is unfusable on I think all relevant CPUs, so it’s both an additional uop and an additional cycle of latency over a tst/jz sequence.
Having had to patch binaries on multiple occasions by inserting instructions, it is definitely not hard to do so for x86 as one can easily find "slack" that the compiler left behind[1], but I once had to do it for a MIPS binary, and it was definitely not easy to squeeze in the few extra instructions I needed inline; I ended up having to detour to another area with jumps instead.
Here's an old paper where the authors tried to optimise for code density manually, and you can consistently see x86 beating ARM and MIPS:
https://web.eece.maine.edu/~vweaver/papers/iccd09/iccd09_den...
[1] See https://news.ycombinator.com/item?id=15720923 for an example.