Most improvements like this don't come from the architecture itself, scale aside...

Most improvements like this don't come from the architecture itself, scale aside. It comes down to training, which is a hair away from being black magic.

The exceptions are improvements in context length and inference efficiency, as well as modality support. Those are architectural. But behavioral changes are almost always down to: scale, pretraining data, SFT, RLHF, RLVR.