The 6T token dataset is surely a high quality subset / refined extract from much larger public datasets. It's misleading to compare their sizes directly.
We don't know what the dataset is. "high quality subset from much larger public datasets" is not just inherently speculation, it's flat out wrong as no such public datasets existed when GPT-4 was trained.