I can understand that years before ChatGPT would not have any LLM-generated text...

ACCount36 · 2025-06-13T07:03:13 1749798193

Not if the goal is to test quality of real datasets, and that was the goal.

Getting this weird information about newer datasets generally outperforming older datasets was more of a side effect of having a dataset evaluation system.

If you're trying to examine AI contamination specifically? There are many variables, and trying to capture them all in a laboratory dataset is rather involved.

For one, AI data out in the wild is "enriched" - it's very likely to be selected by users before being published (human feedback best of 4?), it can gather human interaction like likes/comments, it's more likely to get spread around if it's novel/amusing/high quality than it is if it's low quality, generic and bland. How do you replicate that in a lab setup? On a tight budget?