Your message is more misleading than the GP. Many architectures sold today still...

Your message is more misleading than the GP.

Many architectures sold today still claim unaligned accesses are optional (e.g. all ARM pre-v7, which includes the popular Raspberry Pi Zero). Not to mention that even if they are supported, not all instructions support it (which is the case today on all ARM cores and even on x86).

From the architectures and instructions which may support it, it may have a performance penalty which may range from "somewhat slower" (e.g. Intel still recommends stack alignment, because otherwise many internal store optimizations start giving up) to "ridiculously slower" (e.g. I once had to write a trap handler that software-emulated unaligned accesses on ARM -- on all 32-bit ARMs Linux still does this for all instructions except plain undecorated LDR/STR when the special unaligned ABI is enabled).

And finally, even if the architecture supports it with decent enough performance, it may do it with relaxed atomicity. E.g. even as of today aarch64 makes zero guarantees regarding atomicity of even atomic instructions on unaligned addresses (yes, really). To put it simply because it is a _pain in the ass_ to implement correctly (say programmer does atomic load/store on overlapping addresses with different alignments). This is whether they cross cache lines or not.

i.e. it's as a bad as the GP is saying. You can't just put one example of one processor handling each case correctly to dismiss this claim, because the point is that most processor's don't bother and those who do bother still have severe crippling limitations that make it unfeasible to use in a GP compiler.

And there is still a lot of benefit to packing things up... but it does require way too much care and programmer effort.