Understanding how a given CPU (+ the other computer hardware) works, does not su...

danielmarkbruce · on Feb 4, 2024

This is the wrong analogy. The transformer block is a bunch of code and weights. It's a set of instructions laying out which numbers to run which operations on. The optimizer changes weights to minimize a loss function during training and then the code implementing a forward pass just runs during inference. That's what it is doing. It's not doing something else.

If the argument is that a model is a function approximator, then it certainly isn't approximating some function that performs worse at the task at hand, and it certainly isn't approximating a function we can describe in a few hundred words.

FeepingCreature · on Feb 5, 2024

We have no reason at all to be certain of the latter.

danielmarkbruce · on Feb 5, 2024

There is pretty good reason. If it could be described explicitly in a few hundred words, it would be extremely unlikely that we'd have seen a jump in capability with model size.