> Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto regressive next token training scheme, and not necessarily to a particular choice of architecture.
I would have hoped they would attribute LLM success to the structure of language itself. As the authors say, even small linear models can approximate CoT and solve complex tasks. So it's not the model. It's the data.
Analogously, humans have very different brains when looking at low level, but brains still learn the same language and skills about as good as any other brain. It's not the brain or neural net (the models) but the data that shapes them to become smart.
This insight has consequences on how we view training data and where to focus our work to improve AI and human brains - improve language, ideas & chains of thought. This resonates with recent discoveries in fine-tuning and training small models like phi-1 and phi-1.5 who were trained on "textbook quality" data of high diversity.
Super interesting - this almost sounds like evidence of linguistic determinism / relativism, i.e. the idea that our language influences how we perceive and think. Is that what you’re thinking too?
I don't think that's what the GP is going for – more like that language is a way to general intelligence, and it doesn't matter what your "implementation" of language is. Just like there's an incredibly diverse range of models and formalisms proven to be Turing complete and as such, equivalent to each other, even though at first glance many of them look just as non-Turing-complete as an autoregressive next-token predictor looks non-AI-complete.
(As all known human languages are Turing complete, every language should be completely equivalent in expressive power – only nuances are different and as such language doesn't affect thinking in any meaningful sense, and this seem to be corroborated by evidence.)
((On the other hand, it appears that individual people's brains indeed come up with "strategies" how to think – some think more verbally, others more visually, and yet others think in abstract, conceptual ways difficult to even put into words. For example, not nearly everyone has an "inner voice". Yet these strategies all appear to be approximately equivalent in their "thinking power".))
Does this imply a causation link between language skills and general intelligence? Rats are smart but have weak language skills, same with octopus afaik.
Yes, over long time spans language accumulates and transmits experience, so rats would be disadvantaged while humans got a huge boost.
What I wanted to say originally is that we are riding on the shoulders of giants - the whole corpus of knowledge and concepts, ways to use them that have been discovered by previous generations, at great expense. AI and new born humans inherit this language heritage.
That is why I would attribute to language most of our intelligence. Not all, because we adapt and contextualise, sometimes we stumble on a new thing, a new concept or piece of knowledge. But that is a rare phenomenon, we are 99% of the time contextually reusing the crystallised intelligence in language.
We're like LLMs, where each token generated "visits" the synthesis of the whole human culture - the model weights - before being fully formed. Our thoughts travel the same path - they visit the model of human culture before being formed.
If we lost our language and knowledge we'd have to redo the path again, over a long time, and pay the same price. In a sense, language is smarter than us - smarter than one human generation can accomplish.
Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.
I think that's obvious isn't it? Neural networks are universal function approximates, the question is how to make the efficient, either in parameters or computation or whatever, as well as all the usual stuff like encouraging convergence, avoiding big gradients, etc. That's why transformers are popular, nobody thinks they especially can compute a function that other models can't.
Yeah I immediately thought, isn't this congruent to a statement about
Turing machines. Sure there are classes of many things that are
computationally equivalent, including computers made of paper-tape,
tin cans and string. Just most of them are horrendously inefficient
and useful only as thought experiments.
I saw this again in audio synthesis but with more nuance. Most methods
are equivalent in some crazy limit, but all have a "special" area of
most useful effectiveness. For example in theory you can predict a
signal of many minutes or hours just using linear prediction (LPC),
but only at the cost of a gargantuan parameter space that's less
efficient than just sampling the signal.
Nonetheless it is nice to see that researchers are connecting up these
dots, even if the pure maths behind it isn't saying anything obviously
useful right away. Who knows what insights this might lead to for
discovering other new methods of computation.
I think the case can be overstated, or at least there are problems.
In the prehistory of transformers I was training character-based LSTMs and GRUs to write fake abstracts for PubMed abstracts of clinical case studies as a proxy for clinical notes.
This was before conversational models and prompts and one big problems was the system started out in the same state. Holistically the system has to decide if the patient has pancreatic cancer or athlete’s foot and a coherent story but really the system had to pick one of 26 letters to start with and then pick another character, assuming it spells good it is starting out in a constrained state space and on a knife edge between writing the same abstract over and over and writing gibberish based on the temperature.
At the time we thought putting in an initial state as an embedding would have helped, although for reading clinical reports we’d rather have a final state as an embedding.
Knowing what I know now we should have made up prompts (“Case study of a 23-year old man with a fractured tibia who presented at the emergency room:”) and stuck them in front of the abstract but of course that would have meant actually reading the 80,000 abstracts (40 person x days of work, our team could have done it in 2 weeks)
The thing is there is a gap between what is possible and what is practical. A good author has an end in mind and writes something and rewrites it and I have been in so many conversations with people speculating about the inference algorithm used by ChatGPT because so often it seems implausible you could really get good results token-at-time.
Humans have two modes of thinking. What is one plus one? Two! No real thinking involved, answering that question got hard wired into your brain. This is what we are currently teaching large language models, a hard wired function from inputs to outputs. What is 51,307 minus 17,469? Now you have to actually start thinking, you have memorized a procedure to do this and you know how to follow this procedure the arrive at the answer.
This is somewhat like chain of thought where intermediate results - which a human would either keep in memory or write down if it gets too much - get dumped out with the output in order to be consumable when later tokens are produced. This could also span several levels, where you first derive the procedure to follow from another procedure you have memorized to than answer the actual question.
And how you solve problems can change over time, when you first learn a thing, then you usually learn a procedure to follow. When you first learn to write letters, you consciously draw a sequence of lines and arcs. After enough practice this becomes hard wired and you just write the letter. When you do not do something for a long time, then it may go the other direction, you might still remember the procedure and be able to do it step by step, but you can no longer just do it.
So what is the point of this comment? While you might in principle be able to learn any function from inputs to outputs, that is create a large language model that can produce the correct answer for every question without really thinking about it, I do not think that this is practicable. Every time a human follows a step by step procedure you will essentially have to learn the completely unrolled procedure in order to produce the answer in one pass. Feed forward neural networks have no loops unless you externally feed the output back into the input.
That also means you have to necessarily mix the intermediate internal states with the output of an autoregressiv feed forward system, i.e. it will never be able to behave like a human as it will constantly have to output its internal thought process. If you want to mimic a human response, you will have to hide part of the autoregression internally.
Not sure about all of the details, but this is an interesting idea focusing on how auto-regressive models can be thought of as learning how to split a difficult task into a series of simpler tasks.
Makes me wonder if that's the magic in denoising autoencoders, too, since they are trained basically to learn how to build an image auto-regressively from more to less noise.
I would have hoped they would attribute LLM success to the structure of language itself. As the authors say, even small linear models can approximate CoT and solve complex tasks. So it's not the model. It's the data.
Analogously, humans have very different brains when looking at low level, but brains still learn the same language and skills about as good as any other brain. It's not the brain or neural net (the models) but the data that shapes them to become smart.
This insight has consequences on how we view training data and where to focus our work to improve AI and human brains - improve language, ideas & chains of thought. This resonates with recent discoveries in fine-tuning and training small models like phi-1 and phi-1.5 who were trained on "textbook quality" data of high diversity.