There is the web of trust. If you really trust a person to say that their stuff isn't AI, then that's probably the most reliable way of knowing. For example, I have a few friends and I know their stuff isn't AI edited because they hate it too. Of course, there is no 100% certainty but it's as certain as knowing that they're your friend at least.
But the question is about whether or not AI can continue to be trained on these datasets. How are scrapers going to quantify trust?
E: Never mind, I didn’t read the OP. I had assumed it was to do with identifying sources of uncontaminated content for the purposes of training models.