Hacker News new | past | comments | ask | show | jobs | submit login

36gb of compressed text is a lot though.



That's about 50B tokens. It can be a lot depending on what you want to do, but It's 10% of what GPT-3 (the definition of world class GPT model) used.


For what it's worth, we're serious about replicating GPT-3. books3 is just one piece. You will notice I never claimed equivalency to GPT-3's training data.

books3 may be 10%, but The Pile is building the rest:

https://twitter.com/arankomatsuzaki/status/13204141418954874...

https://github.com/EleutherAI/The-Pile

https://www.eleuther.ai/get-involved

https://media.discordapp.net/attachments/735217892517216366/...

https://media.discordapp.net/attachments/735217892517216366/...


And again, I find books3 extremely cool and important work on your part, looking forward to the rest.

I just have a minor gripe with saying that "now we can train world class GPT model" thanks to that, as you said it's just one piece, and as a typicial HNist I had to point it out :).


Believe it or not, I appreciate and relate to that sentiment.

But after spending roughly one year acquiring knowledge related to this work, I feel I can say with a fairly high degree of certainty that this dataset alone is enough to train a model that will achieve "world class" status in some area. Writing books, perhaps.

Which part of my logic do you feel is mistaken, and why? I am actually quite interested to hear thoughts from someone who is very pedantic about such things.


I don't think you are mistaken, I guess it's just that there is just so much information you can convey in a tweet, when I read "world class GPT model" I understand a model that will beat (or equal) on general NLG, which is not what you meant it seems.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: