Suppose you wanted to train an LLM to do addition.
An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.
And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.
And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.
It’s exactly this kind of thinking that underlies lossless text compression (not exactly what a transformer guarantees but often what happens). For that reason, some people thought it would be fun to combine zip and transformers. https://openreview.net/forum?id=hO0c2tG2xL
An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.
And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.
And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.