Hacker News new | past | comments | ask | show | jobs | submit login
Textbooks are all you need (arxiv.org)
256 points by foobarqux on June 21, 2023 | hide | past | favorite | 106 comments



8 passes so 50B tokens seen. Still, any other >50% HumanEval model is much much bigger https://twitter.com/SebastienBubeck/status/16713263696268533...

Now this wouldn't be possible without the high quality synthetic dataset produced by GPT(1B tokens) but this is more evidence in line with Tiny Stories (https://arxiv.org/abs/2305.07759). That is, LLMs only need to be so big (both data and parameters) to learn the total sum of human knowledge (and deal with trash data).


The speed of training is really interesting to me too. Llambda labs rent out 8 A100s for ~$9-12/hour. Total training cost there would be ~$850-1200. Another ~$2000 for the GPT training data, some more given there would be input tokens too.

Very approachable prices.

> Now this wouldn't be possible without the high quality synthetic dataset produced by GPT(1B tokens)

Just a note that this is GPT3.5 (I assume turbo?).


This result puts more emphasis on data quality, which might actually be problematic long term though? The vast majority of human knowledge is not represented in high quality textbooks.

It’s not clear from these results but the paper seems to at least imply that the importance of the synthetic data is to “unlock” the pre-training data.

My takeaway as a non-expert is that this is a good result for small and efficient models focused on well-defined domains, but a neutral or maybe even a bad result for a model that displays general intelligence.


The vast majority of human data can be refined into something approximating high quality textbooks, which is what happened here too.


I wonder if that is true. Intuitively it seems to me that there are probably many areas of knowledge that cannot be reliably summarized into a textbook. Most subjective experiences, for example. Qualia have been notoriously difficult to describe too, even if the might be absolute.

Even for fields that lend themselves well to being converted to textbook format there is often a tradeoff between accuracy and conciseness. The more you refine, the more nuance you throw away. It seems like a Very Hard Problem (tm) to know which data are superfluous and which are not, especially at scale.


If there's something that cannot be reliably summarized into a textbook, then can we expect to see an agent reliably output it as text in the first place?


Yes. Translation of classical languages is a good example. There are lots of subtle nuances that aren't captured by any pedagogical text and we accept that humans require require apprenticeships/graduate study to become competent (self study is not enough), but the output is obviously text.


You're describing a good example of a process that requires tacit knowledge. The process of translation results in an output of text, but we can easily represent this output in a textbook. What we can't easily represent in a textbook is the subtle nuances around the process of how to translate. I was referring to that kind of tacit knowledge itself (similar to OP's example of qualia).


Depends on the sophistication of the agent. I have agency myself and can express my current mental state quite readily, or can express personal opinions about a wide variety of subjects. I'm not sure how you would write a useful textbook about it though.


We already summarize people's mental states and personal opinions in textbooks. No problem there.


If it can't be reliably summarized, how can a human judge the output of the agent then? Compare it against what? It's a bit circular.


The point is that if something is difficult to express or encode in to words for a textbook, then there's an underlying reason that would apply to not just textbooks but also to others who try to write about the same thing.


Well, used to be a Very Hard Problem. Now you can just test it: train a model on the textbook and ask it to solve many problem challenges that it hasn't seen before.

After you get a score for the model trained on the whole textbook, try removing each sentence in the book in turn. If removing that particular sentence decreased the test scores then keep it in, else throw it away.


I don't think that is sufficient:

- Perhaps the model can be improved by adding sentences, but which ones? There is a potentially infinite amount of sentences to add and no good way to select them

- Your proposed method handwaves the method of acquiring "many problem challenges that it hasn't seen before", which just moves the problem. Constructing a problem set containing the full range of potential problems to solve is again a Very Hard Problem.

The second problem is not just theoretical either: see https://sitn.hms.harvard.edu/flash/2020/racial-discriminatio... for example, where a facial recognition algorithm performed much worse on non-white women because the training set didn't contain enough pictures of them. AI software is notorious for finding this kind of loophole and for overfitting itself for the training set rather than for the real world.


Very interesting point. I wonder: are subjective experiences knowledge? Not scientific knowledge.

One one side they are part of who we are. On the other side, same as an airplane does not copy a bird 100%, it makes sense that to make a machine "think" we would feed it rational content. That is, content that follows the scientific method.


I suppose that depends on what type of AI you are trying to build. One that is used to help design airplanes can mostly get by with pure scientific knowledge, although it still needs to know about softer phenomena like claustrophobia and personal space to understand that packing in humans as close as physically possible is not desired.

On the other end of the AI spectrum, an AI being trained to be a childrens toy, old people companion, nursing robot or even psychologist would be incredibly deficient if it didn't understand emotions. I don't think using purely content that follows the scientific method would be sufficient for such an AI.


Good points. I especially like the claustrophobia one in planes.

On the other side, there are textbooks about "emotions". I own a Cognitive Behaviour Therapy manual (seemed interesting) that goes step by step through how to conduct a session (it is aimed at therapists or at the reader being their own) as it progresses. As a textbook, similar professional references could be more useful than say reddit or some blogs. Those other sources on the internet will be mostly derived from the reference materials in the field, or from personal anecdotes.

I can imagine textbooks exists on how to deal with people on the spectrum, or for them to recognize social cues, or how symptoms of depression look like.

I am still skeptical. You did not say anything of the sort, but the scientific method is not the opposite of emotions. It shows us what we know of emotions so far. Psychology has the reproducibility scandal but it is our best try.

Even in the case of planes, I would not be surprised if an engineering manual mentions a minimum space for passengers to be comfortable, as part of regulations, or any other data that might as a side-effect solve the claustrophobia issue.

TL;DR. Scientific knowledge is not antithetic to "human" knowledge. Scientific knowledge is what we actually know. Otherwise we have mysticism, or faith.


I'm still skeptical tbh. The scientific method in psychology seems like it could only ever work in a very probabilistic manner since, unlike physical phenomena, humans predict very different to the same stimulus depending on their own unique circumstances.

For example: People react very different to being offered a bacon pizza depending on (at least) the time (not for breakfast thanks), location (funerals are right out), their pizza topping preferences, their faith, whether they've already just eaten, how much they trust the person offering the pizza, if they're currently in a group or not (is there enough for everyone?), whether they've ever had any traumatic experiences with pizza or not, if any other foods may be available, whether they have dinner with other people planned, their current dietary restrictions, whether they're ill right now or not, whether they know you are doing pizza-based experiments on them, etc etc etc etc.

All of these might radically alter the response you get. It might be possible to get enough information about someone to make a reasonable guess, but you can never know if your model is complete enough. Even if the same inputs occur twice, the output might still be different based on something you cannot reasonably know. So I think it would be very difficult to construct a predicable stimulus/response model for individuals. Groups might be easier because the differences sometimes average out, but quite often the internal communication leads to feedback loops that disturb any predictions you were trying to make about their behavior.

Btw there is a third option beyond "scientific knowledge" and "faith", which is simply "we do not know and may never know" without ever getting to fill in that gap. Things like "what, if anything, created the universe" fall in that category as we cannot observe such a thing. Given some information about the position of a particle, we can never be sure of its speed per the Heisenberg uncertainty principle. Accurate models of people could be similar: it might simply be that they cannot be meaningfully reduced to simple formulas.


Thanks for your reasoned answer. I agree with what you are saying. A couple more comments, not to correct anything but simply for conversation.

> Btw there is a third option beyond "scientific knowledge" and "faith"

The three I usually see mentioned are: I know because "argument" (reason/scientific); I believe (faith) and mysticism (I just know/I had a revelation). I agree with you, different ways of not taking a position are positions in themselves. Skepticism, nihilism, etc.

> Accurate models of people could be similar: it might simply be that they cannot be meaningfully reduced to simple formulas.

I agree. All models are false but some are useful.


Only 1/7 of the data they are using is synthetic though. The other 6/7 was just filtered from a larger dataset, not refined in any way.


The title seems rather speculative given that they didn't use any textbooks as training data, or even anything especially similar to textbooks. They used mostly StackOverflow data and source code from The Stack dataset, and also explanations, examples, and exercises generated by GPT3.5.


Exactly.. Considering the model that generated their exercises was also trained on Stack Overflow data, one may say, not all but what you definitely need for LLM training is Stack Overflow data. People keep getting it granted but I dont think it will continue that way for a long time.


I was hoping this would be about learning on your own, but it's "AI" stuff.


The title is a reference to the famous "Attention is all you need" paper.

https://arxiv.org/abs/1706.03762


Apparently, when coming up with titles for papers 'All you need is all you need'.


(Someone has to say the obvious thing.)

"'All you need' considered harmful"


The internet increased the speed of information propagation, but it also made 'cute' imitation annoying way faster.

I guess it shouldn't be surprising. With reams of information piling up, it can be hard to get noticed. So you look for gimmicks to get attention (hah!). The trouble is that you are not unique and a thousand other people have the same idea... and your cute imitation is just tired.


Well, of course, which is why I prefixed the imitation with a self-referential remark in parentheses, meant to turn a low-effort attempt at humor into a meta-joke, that the audience would consider clever.

It used to work, but judging by your diatribe, maybe it got tired already and I didn't notice.


Oh, sorry, I was still referencing the proliferation of 'All you need is <blank>', not criticizing on your comment.

I can totally see how what I posted could be interpreted that way, though.


My personal fave is “Torch.manual_seed(3407) is all you need”


The next step change for ML will be some kind of broad model (like the the LLM equivalent of JEPA) outlining the answer from a “big picture view” and then using an ensemble of small, precise models like this one filling in the details.


> using a selection of ``textbook quality" data from the web ... and synthetically generated textbooks and exercises with GPT-3.5

Is this common, training a LLM from another LLMs generated output? How do you avoid "bad code" from GPT, if not out right hallucinations?


>Is this common, training a LLM from another LLMs generated output?

It's not uncommon. And the level the SOTA LLMs are now, it'll only become more common.

It happens because we're at the stage where GPT-3.5/4 can generate much better data (or hit more encompassing or general distributions with the right instructions) than what the majority of LLMs will be typically trained on.

Also 4 can recheck output for bugs, mistakes, errors etc

It may not be perfect but neither would "natural data" and depending on exactly what kind of data, it might be as close to perfect as you'll get - https://arxiv.org/abs/2305.07759


When humans think this is what they are doing: retraining themselves on their own learned data. In other words, knowledge is not a conserved quantity. Of course the extent of the extra knowledge you can produce without access to the environment is limited.


Do you have any actual idea if that's the case or are you just saying stuff?

I've read that people reorganize information when they sleep, I wouldn't say that's "thinking".


It was intended as philosophy (“just saying stuff”) rather than neuroscience. i.e. a black box rather than white box perspective on the mind.

The point is that you don’t need a recent external environment to grow your knowledge in certain ways.

For example, if you locked a sufficiently gifted and immortal mathematician in prison with no knowledge of the outside world after 2021, he can still prove new theorems forever afterwards (provided he has enough coffee). He doesn’t need access to recent mathematical results to keep proving theorems and growing his knowledge base - that’s simply an accelerant. In fact the longer he stays in prison, the more new theorems he can prove because he has ever more lemmas.

Whereas if you locked a sufficiently gifted and immortal chemist in prison with no knowledge after 2021, she might be able to synthesize some new knowledge at first (meta analysis, etc), but eventually that would run dry and she wouldn’t be able to say anything new about chemistry anymore without any equipment.


What about dreaming? I think when I dream, sometimes even finding solutions to problems for my woke self.


Don't look at sleeping. Look at your inner voice, if you have one, and the back-and-forth you're doing with it. This is a form of thinking, and it is also clearly feeding your mind's output back into it as input.


I have never heard an "inner voice" when sleeping or dreaming so I have on idea what you're referring too ?

I'd say that inner voices are mostly stupid idiots :)


The "inner voice" isn't a sleeping thing, it's an awake thing.

I guess you may be one of the presumably minority of people who doesn't have an inner narrative? I'm having a hard time imagining how this works, but then again, I'm aphantasic, and a lot of people have trouble imagining how that works too.


So in dreams you have conversations with yourself? Who are your Temporal? A researcher? Neurologist? Just a guy on the internet who reads a lot of comments?


If you think about it, how do humans learn things?

A child gets alot of his knowledge by randomly copying what grownups do.


Exactly.

The False Promise of Imitating Proprietary LLMs (UC Berkeley, 25/May/2023)

https://arxiv.org/abs/2305.15717


This is probably why openai sets the knowledge cutoff when it does.


   "displays surprising emergent properties"
Section 3 states "includes managing intricate algorithmic tasks." I take this as meaning it can write new code (not regurgitation). It is impressive that is can write code as described in the prompt, but I didn't think this was considered "emergent" but "generalization." What is the difference?


There is no difference, "emergent" is another hype term used in place of "discontinuous generalisation".

ie., with 1MB data you dont get generalisation, with 1TB you do. So they call that emergence.

I'm almost at the point where I can let these things go; but each new bit of mystifying hype, another stone falls in my shoe.


What is the "mirage" then that folks are claiming in LLM's as described in the Stanford research?

https://hai.stanford.edu/news/ais-ostensible-emergent-abilit...


Stanford is the centre of the AI hype. They are a profoundly mendacious VC-seeking institution, in this matter.

None involved know what emergence means; they're engineers chasing and creating hype.


In this case they are squashing the hype by saying there is no real emergent behavior.


I think you're just inventing your own definition of the word "emergent", then getting on your high horse when people, quite obviously, don't use the same definition as the one you just made up.


Yeah, the way I read "emergent" is as "surprising" to our intuition, but technically expected. So I read this as surprisingly surprising performance.


The English explanations you find first for emergent are not really fitting here. Interestingly the Germany explanation for the same word seems to be much better.

I'm pretty sure they mean emergent as in emergent gameplay.

Here's a good explanation for that:

> Emergent gameplay refers to complex situations in video games, board games, or table top role-playing games that emerge from the interaction of relatively simple game mechanics.

So emergent is not inherently surprising. Games like Dwarf Fortress or Rimworld have a lot of emergent complexity but most of it was not surprising.

So in the context of LLM it means that complex reasoning is emergent based on simple underlying mechanics.

And since we thought humans’ complex reasoning skills were very unique.. We are surprised by the emergence of it in LLMs.

PS: The explanation for "emergence" is much better:

> In philosophy, systems theory, science, and art, emergence occurs when a complex entity has properties or behaviors that its parts do not have on their own, and emerge only when they interact in a wider whole.


Yes, so this is the meaning of the word I'm rejecting. NN models do not acquire emergent properties.

Human reasoning (, game play, etc.) is emergent in the sense that it is irreducible to its parts. "Generalisation in high-parameter training regiemes" is fully reducible. It is not emergence in any relevant sense of the term.


Sorry, I meant the way I read peoples colloquial use of it toward these llms. They aren't actually displaying emergence, as this level of performance was predicted.


This is a misleading title. GPT-3.5 is all you need!

Their whole method relies on getting GPT-3.5 to produce examples and then training a network on those examples. This is a run-of-the-mill method called distillation.

There's nothing new or special here.


´all you need’ is going to become an anti-signal very quick


All you need considered harmful


The cool kids now use 'everything everywhere all at once'

https://arxiv.org/search/?query=everything+everywhere+all+at...


Is your company still using buzzwords? You're behind the times, friend, buzzphrases are all the rage now!


Ironically, if they are using attention methods, then dictionaries are not the only thing you'll need.


But it's still only 50 something percent on HumanEval right? So to really prove this out you still need to try to make a much larger model.

I wish someone would do something like that but not use OpenAI's models to create the training data because then supposedly it can't be used for commercial purposes.

Not that it's easy to create a lot of high quality training data.


Looking at the `Educational values deemed by the filter` table on page 5, I can't help wondering whether compressibility might not be a usable proxy for educational value. High compressibility would imply low information content. High marginal compressibility when added to a large dataset would imply the information is already in there. This is basically Normalised Compression Distance all over again - it's only a rough proxy, but I guess the question is whether it could work at all, rather than is it as good as this result.

If that worked as a proxy value, you could sidestep needing GPT-4 at all.


From page 6, when they describe their synthetic textbook dataset:

"Consider the matrix A = np.array([[1, 2], [2, 4]]). We can check if this matrix is singular or nonsingular using the determinant function. [...]"

No. The determinant is not a suitable way to do that. A proper way to numerically measure singularity would be to compute the condition number of the matrix (the ratio of its largest to smallest singular value).


While you're correct that condition number is a more robust numerical method for arbitrary matrices, the determinant is certainly suitable for many matrices. For small matrices with small integer values such as this one, there is no issue with the determinant.

There is no one bulletproof general method to approximate mathematical calculations with floating point numbers. More context is generally required, including the actual problem that is being approximated, to determine if a method is reliable. Painting this as a black and white situation where the determinant is wrong and the condition number is right gives a misleading picture of how we evaluate numerical methods for fit-to-purpose.


Am I misreading this, or is it really about a 2x2 matrix? (In which case computing the determinant involves just two multiplications and one subtraction.)


They define a generic "is_singular" function and test it with a 2x2 matrix.

The problem with the determinant is not about performance. It is just useless for determining if a matrix is singular. The thing that gives it away is that the determinant is influenced by a rescaling of the matrix:

det(s A) = s^n det(A) where A is a n x n matrix

As an example, would you say that [[1e-10, 0], [0, 1e-10]] is singular? It has condition number 1.


It does work in theory and for integer numbers. Such an algorithm might be used in practice when numerical stability is concerned, but condition number can certainly be defined somewhere in another textbook and you just need more context to tell it to use another method.


Edit: Sorry, I completely messed up my original answer here. A better version:

Let's say we are in a setting where we only work with integers. A matrix is invertible iff its determinant is invertible in the underlying ring. The only invertible elements in Z are -1 and 1.

So, the code is also incorrect in the integer setting. Here, we should not check for 0, but for -1 or 1.


If I'm reading you correctly, you'd also need to flip the response to the check: where the original test for 0 determines that a matrix is singular if it does find 0, the new test for ±1 should determine that a matrix is singular if it does not find ±1?


Yes, exactly.


Saying that the determinant is useless to determine whether a matrix is incorrect, or misleading at best.

This is purely a matter of numerical stability. Of course [[1e-10, 0], [0, 1e-10]] is nonsingular, and its determinant is 1e-20, which does not equal zero.

Yes, when it comes to floating point issues we might want to use something else, and that's a valid complaint when it comes to NumPy code, but from a theoretical perspective the determinant is an excellent tool to determine singularity.


Of course. Theoretically, the determinant answers the binary question "singular" or "nonsingular".

Numerically, such a binary answer is pretty useless. Here, we need a measure of how singular/nonsingular a matrix is relative to the numerical precision we are working with.


The words "are all you need" are all you need


All you need is weed, weed, weed is all you need.


We have so much to learn and optimize. I wonder how far we might get. But it was evident from the start to me that e.g. harvesting Reddit and other sources and just crossing your fingers, hoping for the best may not be the most optimal method.

Quality is everything and part of quality is also having it be structured and geared towards an AI which may not be identical to being geared for human reading.


Does using the synthetic data generated by GPT has the same effect of being RLHF-ly Aligned by GPT3.5, kind of like aligning the NNs to get similar performance as GPT.


Its interesting this paper & the Orca LLM paper from Microsoft are using GPT3/4 model outputs to train 'powerful' models. Big question is will they allow your average joes/businesses to do the same on their own data? Doubt it, considering it breaks OAIs terms of use at present. Will opensource lead the way on this? Bring on Llama-v2


Overall a very interesting result.

I’m not sure why they compare to a list of results for the MBPP benchmark but don’t seem to include these results which are much better:

https://paperswithcode.com/sota/code-generation-on-mbpp


why don't they use a real textbook?


Because it’d either be expensive or illegal, and all content owners are paying attention now


Is it possible to test or access this model through a web interface or any other means?


Interesting that two of the most powerful techniques which can essentially kill chat GPT : Orca and Textbooks, both come from Microsoft.


It’s interesting to me how a lot of these new models use GPT for the synthetic data generation to then train a better or more focused model


Model binary to Huggingface or it didn’t happen.


Can't believe I had to scroll that much to see this comment. Upvoted and Thanks.


tldr;

"Just as a comprehensive, well-crafted textbook can provide a student with the necessary knowledge to master a new subject, our work demonstrates the remarkable impact of high-quality data in honing a language model’s proficiency in code-generation tasks. By crafting “textbook quality” data we were able to train a model that surpasses almost all open-source models on coding benchmarks such as HumanEval and MBPP despite being 10x smaller in model size and 100x smaller in dataset size. We hypothesize that such high quality data dramatically improves the learning efficiency of language models for code as they provide clear, self-contained, instructive, and balanced examples of coding concepts and skills."


> phi-1 attains pass@1 accuracy 50.6% on HumanEval

So, a coin toss effectively, right?


No, scoring 50% on a test is not the same as a coin toss, for very obvious reasons.


Only if the questions are binary (yes/no).

If multi-choice or free-text, random answering will be much below 50%.


gg


I bet it still gaslights invented facts.


Sam Altman has said it will take 1-2 years to solve the hallucination issue.


So it’ll be solved a little after Tesla achieves full self-driving, and just before we get fusion power. Great.


Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

Actually, it turns out that impact factor is all you need.


Turns out, all you need is, indeed, all you need.


I vote the title meme “… is all you need” end now.


I like the reasoning here. If you think about it, to end this madness a vote is all you need.


Suggested title: "Is all you need considered harmful"


Myths data scientists believe about "is all you need considered harmful"


The unreasonable effectiveness of considered harmful essays is all you need


All you need hate him!


Harm is all you need


The whole NLP field is full of memes. Just look at the model names: Llama, Gorilla, Jarvis, Falcon, Vicuña, Alpaca, etc.


Not just NLP the whole IA field, see YOLO and NeRF.


Then there's all the sesame street names for NLP models. https://i.redd.it/048rybz0yuv81.png


Over-simplified requirements for complex topics is all you need




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: