Now this wouldn't be possible without the high quality synthetic dataset produced by GPT(1B tokens) but this is more evidence in line with Tiny Stories (https://arxiv.org/abs/2305.07759). That is, LLMs only need to be so big (both data and parameters) to learn the total sum of human knowledge (and deal with trash data).
The speed of training is really interesting to me too. Llambda labs rent out 8 A100s for ~$9-12/hour. Total training cost there would be ~$850-1200. Another ~$2000 for the GPT training data, some more given there would be input tokens too.
Very approachable prices.
> Now this wouldn't be possible without the high quality synthetic dataset produced by GPT(1B tokens)
Just a note that this is GPT3.5 (I assume turbo?).
This result puts more emphasis on data quality, which might actually be problematic long term though? The vast majority of human knowledge is not represented in high quality textbooks.
It’s not clear from these results but the paper seems to at least imply that the importance of the synthetic data is to “unlock” the pre-training data.
My takeaway as a non-expert is that this is a good result for small and efficient models focused on well-defined domains, but a neutral or maybe even a bad result for a model that displays general intelligence.
I wonder if that is true. Intuitively it seems to me that there are probably many areas of knowledge that cannot be reliably summarized into a textbook. Most subjective experiences, for example. Qualia have been notoriously difficult to describe too, even if the might be absolute.
Even for fields that lend themselves well to being converted to textbook format there is often a tradeoff between accuracy and conciseness. The more you refine, the more nuance you throw away. It seems like a Very Hard Problem (tm) to know which data are superfluous and which are not, especially at scale.
If there's something that cannot be reliably summarized into a textbook, then can we expect to see an agent reliably output it as text in the first place?
Yes. Translation of classical languages is a good example. There are lots of subtle nuances that aren't captured by any pedagogical text and we accept that humans require require apprenticeships/graduate study to become competent (self study is not enough), but the output is obviously text.
You're describing a good example of a process that requires tacit knowledge. The process of translation results in an output of text, but we can easily represent this output in a textbook. What we can't easily represent in a textbook is the subtle nuances around the process of how to translate. I was referring to that kind of tacit knowledge itself (similar to OP's example of qualia).
Depends on the sophistication of the agent. I have agency myself and can express my current mental state quite readily, or can express personal opinions about a wide variety of subjects. I'm not sure how you would write a useful textbook about it though.
The point is that if something is difficult to express or encode in to words for a textbook, then there's an underlying reason that would apply to not just textbooks but also to others who try to write about the same thing.
Well, used to be a Very Hard Problem. Now you can just test it: train a model on the textbook and ask it to solve many problem challenges that it hasn't seen before.
After you get a score for the model trained on the whole textbook, try removing each sentence in the book in turn. If removing that particular sentence decreased the test scores then keep it in, else throw it away.
- Perhaps the model can be improved by adding sentences, but which ones? There is a potentially infinite amount of sentences to add and no good way to select them
- Your proposed method handwaves the method of acquiring "many problem challenges that it hasn't seen before", which just moves the problem. Constructing a problem set containing the full range of potential problems to solve is again a Very Hard Problem.
The second problem is not just theoretical either: see https://sitn.hms.harvard.edu/flash/2020/racial-discriminatio... for example, where a facial recognition algorithm performed much worse on non-white women because the training set didn't contain enough pictures of them. AI software is notorious for finding this kind of loophole and for overfitting itself for the training set rather than for the real world.
Very interesting point. I wonder: are subjective experiences knowledge? Not scientific knowledge.
One one side they are part of who we are. On the other side, same as an airplane does not copy a bird 100%, it makes sense that to make a machine "think" we would feed it rational content. That is, content that follows the scientific method.
I suppose that depends on what type of AI you are trying to build. One that is used to help design airplanes can mostly get by with pure scientific knowledge, although it still needs to know about softer phenomena like claustrophobia and personal space to understand that packing in humans as close as physically possible is not desired.
On the other end of the AI spectrum, an AI being trained to be a childrens toy, old people companion, nursing robot or even psychologist would be incredibly deficient if it didn't understand emotions. I don't think using purely content that follows the scientific method would be sufficient for such an AI.
Good points. I especially like the claustrophobia one in planes.
On the other side, there are textbooks about "emotions". I own a Cognitive Behaviour Therapy manual (seemed interesting) that goes step by step through how to conduct a session (it is aimed at therapists or at the reader being their own) as it progresses. As a textbook, similar professional references could be more useful than say reddit or some blogs. Those other sources on the internet will be mostly derived from the reference materials in the field, or from personal anecdotes.
I can imagine textbooks exists on how to deal with people on the spectrum, or for them to recognize social cues, or how symptoms of depression look like.
I am still skeptical. You did not say anything of the sort, but the scientific method is not the opposite of emotions. It shows us what we know of emotions so far. Psychology has the reproducibility scandal but it is our best try.
Even in the case of planes, I would not be surprised if an engineering manual mentions a minimum space for passengers to be comfortable, as part of regulations, or any other data that might as a side-effect solve the claustrophobia issue.
TL;DR. Scientific knowledge is not antithetic to "human" knowledge. Scientific knowledge is what we actually know. Otherwise we have mysticism, or faith.
I'm still skeptical tbh. The scientific method in psychology seems like it could only ever work in a very probabilistic manner since, unlike physical phenomena, humans predict very different to the same stimulus depending on their own unique circumstances.
For example: People react very different to being offered a bacon pizza depending on (at least) the time (not for breakfast thanks), location (funerals are right out), their pizza topping preferences, their faith, whether they've already just eaten, how much they trust the person offering the pizza, if they're currently in a group or not (is there enough for everyone?), whether they've ever had any traumatic experiences with pizza or not, if any other foods may be available, whether they have dinner with other people planned, their current dietary restrictions, whether they're ill right now or not, whether they know you are doing pizza-based experiments on them, etc etc etc etc.
All of these might radically alter the response you get. It might be possible to get enough information about someone to make a reasonable guess, but you can never know if your model is complete enough. Even if the same inputs occur twice, the output might still be different based on something you cannot reasonably know. So I think it would be very difficult to construct a predicable stimulus/response model for individuals. Groups might be easier because the differences sometimes average out, but quite often the internal communication leads to feedback loops that disturb any predictions you were trying to make about their behavior.
Btw there is a third option beyond "scientific knowledge" and "faith", which is simply "we do not know and may never know" without ever getting to fill in that gap. Things like "what, if anything, created the universe" fall in that category as we cannot observe such a thing. Given some information about the position of a particle, we can never be sure of its speed per the Heisenberg uncertainty principle. Accurate models of people could be similar: it might simply be that they cannot be meaningfully reduced to simple formulas.
Thanks for your reasoned answer. I agree with what you are saying. A couple more comments, not to correct anything but simply for conversation.
> Btw there is a third option beyond "scientific knowledge" and "faith"
The three I usually see mentioned are: I know because "argument" (reason/scientific); I believe (faith) and mysticism (I just know/I had a revelation). I agree with you, different ways of not taking a position are positions in themselves. Skepticism, nihilism, etc.
> Accurate models of people could be similar: it might simply be that they cannot be meaningfully reduced to simple formulas.
I agree. All models are false but some are useful.
The title seems rather speculative given that they didn't use any textbooks as training data, or even anything especially similar to textbooks. They used mostly StackOverflow data and source code from The Stack dataset, and also explanations, examples, and exercises generated by GPT3.5.
Exactly.. Considering the model that generated their exercises was also trained on Stack Overflow data, one may say, not all but what you definitely need for LLM training is Stack Overflow data. People keep getting it granted but I dont think it will continue that way for a long time.
The internet increased the speed of information propagation, but it also made 'cute' imitation annoying way faster.
I guess it shouldn't be surprising. With reams of information piling up, it can be hard to get noticed. So you look for gimmicks to get attention (hah!). The trouble is that you are not unique and a thousand other people have the same idea... and your cute imitation is just tired.
Well, of course, which is why I prefixed the imitation with a self-referential remark in parentheses, meant to turn a low-effort attempt at humor into a meta-joke, that the audience would consider clever.
It used to work, but judging by your diatribe, maybe it got tired already and I didn't notice.
The next step change for ML will be some kind of broad model (like the the LLM equivalent of JEPA) outlining the answer from a “big picture view” and then using an ensemble of small, precise models like this one filling in the details.
>Is this common, training a LLM from another LLMs generated output?
It's not uncommon. And the level the SOTA LLMs are now, it'll only become more common.
It happens because we're at the stage where GPT-3.5/4 can generate much better data (or hit more encompassing or general distributions with the right instructions) than what the majority of LLMs will be typically trained on.
Also 4 can recheck output for bugs, mistakes, errors etc
It may not be perfect but neither would "natural data" and depending on exactly what kind of data, it might be as close to perfect as you'll get - https://arxiv.org/abs/2305.07759
When humans think this is what they are doing: retraining themselves on their own learned data. In other words, knowledge is not a conserved quantity. Of course the extent of the extra knowledge you can produce without access to the environment is limited.
It was intended as philosophy (“just saying stuff”) rather than neuroscience. i.e. a black box rather than white box perspective on the mind.
The point is that you don’t need a recent external environment to grow your knowledge in certain ways.
For example, if you locked a sufficiently gifted and immortal mathematician in prison with no knowledge of the outside world after 2021, he can still prove new theorems forever afterwards (provided he has enough coffee). He doesn’t need access to recent mathematical results to keep proving theorems and growing his knowledge base - that’s simply an accelerant. In fact the longer he stays in prison, the more new theorems he can prove because he has ever more lemmas.
Whereas if you locked a sufficiently gifted and immortal chemist in prison with no knowledge after 2021, she might be able to synthesize some new knowledge at first (meta analysis, etc), but eventually that would run dry and she wouldn’t be able to say anything new about chemistry anymore without any equipment.
Don't look at sleeping. Look at your inner voice, if you have one, and the back-and-forth you're doing with it. This is a form of thinking, and it is also clearly feeding your mind's output back into it as input.
The "inner voice" isn't a sleeping thing, it's an awake thing.
I guess you may be one of the presumably minority of people who doesn't have an inner narrative? I'm having a hard time imagining how this works, but then again, I'm aphantasic, and a lot of people have trouble imagining how that works too.
So in dreams you have conversations with yourself? Who are your Temporal? A researcher? Neurologist? Just a guy on the internet who reads a lot of comments?
Section 3 states "includes managing intricate algorithmic tasks." I take this as meaning it can write new code (not regurgitation). It is impressive that is can write code as described in the prompt, but I didn't think this was considered "emergent" but "generalization." What is the difference?
I think you're just inventing your own definition of the word "emergent", then getting on your high horse when people, quite obviously, don't use the same definition as the one you just made up.
The English explanations you find first for emergent are not really fitting here. Interestingly the Germany explanation for the same word seems to be much better.
I'm pretty sure they mean emergent as in emergent gameplay.
Here's a good explanation for that:
> Emergent gameplay refers to complex situations in video games, board games, or table top role-playing games that emerge from the interaction of relatively simple game mechanics.
So emergent is not inherently surprising. Games like Dwarf Fortress or Rimworld have a lot of emergent complexity but most of it was not surprising.
So in the context of LLM it means that complex reasoning is emergent based on simple underlying mechanics.
And since we thought humans’ complex reasoning skills were very unique.. We are surprised by the emergence of it in LLMs.
PS: The explanation for "emergence" is much better:
> In philosophy, systems theory, science, and art, emergence occurs when a complex entity has properties or behaviors that its parts do not have on their own, and emerge only when they interact in a wider whole.
Yes, so this is the meaning of the word I'm rejecting. NN models do not acquire emergent properties.
Human reasoning (, game play, etc.) is emergent in the sense that it is irreducible to its parts. "Generalisation in high-parameter training regiemes" is fully reducible. It is not emergence in any relevant sense of the term.
Sorry, I meant the way I read peoples colloquial use of it toward these llms. They aren't actually displaying emergence, as this level of performance was predicted.
This is a misleading title. GPT-3.5 is all you need!
Their whole method relies on getting GPT-3.5 to produce examples and then training a network on those examples. This is a run-of-the-mill method called distillation.
But it's still only 50 something percent on HumanEval right? So to really prove this out you still need to try to make a much larger model.
I wish someone would do something like that but not use OpenAI's models to create the training data because then supposedly it can't be used for commercial purposes.
Not that it's easy to create a lot of high quality training data.
Looking at the `Educational values deemed by the filter` table on page 5, I can't help wondering whether compressibility might not be a usable proxy for educational value. High compressibility would imply low information content. High marginal compressibility when added to a large dataset would imply the information is already in there. This is basically Normalised Compression Distance all over again - it's only a rough proxy, but I guess the question is whether it could work at all, rather than is it as good as this result.
If that worked as a proxy value, you could sidestep needing GPT-4 at all.
From page 6, when they describe their synthetic textbook dataset:
"Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is singular or nonsingular using the determinant function. [...]"
No. The determinant is not a suitable way to do that. A proper way to numerically measure singularity would be to compute the condition number of the matrix (the ratio of its largest to smallest singular value).
While you're correct that condition number is a more robust numerical method for arbitrary matrices, the determinant is certainly suitable for many matrices. For small matrices with small integer values such as this one, there is no issue with the determinant.
There is no one bulletproof general method to approximate mathematical calculations with floating point numbers. More context is generally required, including the actual problem that is being approximated, to determine if a method is reliable. Painting this as a black and white situation where the determinant is wrong and the condition number is right gives a misleading picture of how we evaluate numerical methods for fit-to-purpose.
Am I misreading this, or is it really about a 2x2 matrix? (In which case computing the determinant involves just two multiplications and one subtraction.)
They define a generic "is_singular" function and test it with a 2x2 matrix.
The problem with the determinant is not about performance. It is just useless for determining if a matrix is singular. The thing that gives it away is that the determinant is influenced by a rescaling of the matrix:
det(s A) = s^n det(A) where A is a n x n matrix
As an example, would you say that
[[1e-10, 0], [0, 1e-10]]
is singular? It has condition number 1.
It does work in theory and for integer numbers. Such an algorithm might be used in practice when numerical stability is concerned, but condition number can certainly be defined somewhere in another textbook and you just need more context to tell it to use another method.
Edit: Sorry, I completely messed up my original answer here. A better version:
Let's say we are in a setting where we only work with integers. A matrix is invertible iff its determinant is invertible in the underlying ring. The only invertible elements in Z are -1 and 1.
So, the code is also incorrect in the integer setting. Here, we should not check for 0, but for -1 or 1.
If I'm reading you correctly, you'd also need to flip the response to the check: where the original test for 0 determines that a matrix is singular if it does find 0, the new test for ±1 should determine that a matrix is singular if it does not find ±1?
Saying that the determinant is useless to determine whether a matrix is incorrect, or misleading at best.
This is purely a matter of numerical stability. Of course [[1e-10, 0], [0, 1e-10]] is nonsingular, and its determinant is 1e-20, which does not equal zero.
Yes, when it comes to floating point issues we might want to use something else, and that's a valid complaint when it comes to NumPy code, but from a theoretical perspective the determinant is an excellent tool to determine singularity.
Of course. Theoretically, the determinant answers the binary question "singular" or "nonsingular".
Numerically, such a binary answer is pretty useless. Here, we need a measure of how singular/nonsingular a matrix is relative to the numerical precision we are working with.
We have so much to learn and optimize. I wonder how far we might get. But it was evident from the start to me that e.g. harvesting Reddit and other sources and just crossing your fingers, hoping for the best may not be the most optimal method.
Quality is everything and part of quality is also having it be structured and geared towards an AI which may not be identical to being geared for human reading.
Does using the synthetic data generated by GPT has the same effect of being
RLHF-ly Aligned by GPT3.5, kind of like aligning the NNs to get similar performance as GPT.
Its interesting this paper & the Orca LLM paper from Microsoft are using GPT3/4 model outputs to train 'powerful' models. Big question is will they allow your average joes/businesses to do the same on their own data? Doubt it, considering it breaks OAIs terms of use at present. Will opensource lead the way on this? Bring on Llama-v2
"Just as a comprehensive, well-crafted textbook can provide a student with the necessary knowledge to
master a new subject, our work demonstrates the remarkable impact of high-quality data in honing a
language model’s proficiency in code-generation tasks. By crafting “textbook quality” data we were able
to train a model that surpasses almost all open-source models on coding benchmarks such as HumanEval
and MBPP despite being 10x smaller in model size and 100x smaller in dataset size. We hypothesize
that such high quality data dramatically improves the learning efficiency of language models for code as
they provide clear, self-contained, instructive, and balanced examples of coding concepts and skills."
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li
Actually, it turns out that impact factor is all you need.
Now this wouldn't be possible without the high quality synthetic dataset produced by GPT(1B tokens) but this is more evidence in line with Tiny Stories (https://arxiv.org/abs/2305.07759). That is, LLMs only need to be so big (both data and parameters) to learn the total sum of human knowledge (and deal with trash data).