Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can understand that years before ChatGPT would not have any LLM-generated text, but how much does the year actually correlate with how much LLM text is in the dataset? Wouldn't special-purpose datasets with varying ratios of human and LLM text be better for testing effects of "AI contamination"?


Not if the goal is to test quality of real datasets, and that was the goal.

Getting this weird information about newer datasets generally outperforming older datasets was more of a side effect of having a dataset evaluation system.

If you're trying to examine AI contamination specifically? There are many variables, and trying to capture them all in a laboratory dataset is rather involved.

For one, AI data out in the wild is "enriched" - it's very likely to be selected by users before being published (human feedback best of 4?), it can gather human interaction like likes/comments, it's more likely to get spread around if it's novel/amusing/high quality than it is if it's low quality, generic and bland. How do you replicate that in a lab setup? On a tight budget?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: