36gb of compressed text is a lot though.

Voloskaya · on Oct 25, 2020

That's about 50B tokens. It can be a lot depending on what you want to do, but It's 10% of what GPT-3 (the definition of world class GPT model) used.

sillysaurusx · on Oct 25, 2020

For what it's worth, we're serious about replicating GPT-3. books3 is just one piece. You will notice I never claimed equivalency to GPT-3's training data.

books3 may be 10%, but The Pile is building the rest:

https://twitter.com/arankomatsuzaki/status/13204141418954874...

https://github.com/EleutherAI/The-Pile

https://www.eleuther.ai/get-involved

https://media.discordapp.net/attachments/735217892517216366/...

Voloskaya · on Oct 25, 2020

And again, I find books3 extremely cool and important work on your part, looking forward to the rest.

I just have a minor gripe with saying that "now we can train world class GPT model" thanks to that, as you said it's just one piece, and as a typicial HNist I had to point it out :).

sillysaurusx · on Oct 25, 2020

Believe it or not, I appreciate and relate to that sentiment.

But after spending roughly one year acquiring knowledge related to this work, I feel I can say with a fairly high degree of certainty that this dataset alone is enough to train a model that will achieve "world class" status in some area. Writing books, perhaps.

Which part of my logic do you feel is mistaken, and why? I am actually quite interested to hear thoughts from someone who is very pedantic about such things.

Voloskaya · on Oct 26, 2020

I don't think you are mistaken, I guess it's just that there is just so much information you can convey in a tweet, when I read "world class GPT model" I understand a model that will beat (or equal) on general NLG, which is not what you meant it seems.