Certainly they retain not just information but compute capacity in a way that other expensive transformations don’t. I’m hard pressed to think of another example where compute spend now can be banked and used to reduce compute requirements later. Rainbow tables maybe? But they’re much less general purpose.
Not only can we bank computation, speed up physical simulations by 100x but I also saw some work on being able to design outcomes in GoL (game of life).
There was a paper on using a NN to build or predict arbitrary patters in GoL, but I can't find it right now.
It would be interesting to see an analysis of this. I see your point - otoh is there a reason to believe that more computation is being "banked" than say matrix inversion, or other optimizations that aren't gradient descent based?
The large datasets involved let us usefully (for some value of useful) bank lots of compute, but it's not obvious to me that it's done particularly efficiently compared to other things you might precompute.
For converged model training, training is often quite inefficient because the weight updates decay to zero and most epochs are having a very small individual effect. I think for e.g. stable diffusion, they dont train to anywhere near convergence so weight updates have a bigger average effect. Not sure if that applies to llms