Hacker News new | past | comments | ask | show | jobs | submit login

The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.



We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: