The 6T token dataset is surely a high quality subset / refined extract from much... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

geysersam on April 4, 2024 | parent | context | favorite | on: Choose your weapon: Survival strategies for depres...

The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.

og_kalu on April 4, 2024 [–]

We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.

Consider applying for YC's Summer 2025 batch! Applications are open till May 13
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact