Yes, but GP's idea of segregating AI-generated content is worth considering.
If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.
My 2c is that it is worthwhile to train on AI generated content that has obtained some level of human approval or interest, as a form of extended RLHF loop.
Ok, but how do you denote that approval? What if you partially approve of that content? ("Overall this is correct, but this little nugget is hallucinated.")
It apparently doesn't matter unless you somehow consider the entire Internet to be correct. They didn't only feed LLMs correct info. It all just got shoveled in and here we are.
I can see the value of labeling all AI can be trained on purely non-AI generated content.
But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.
Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.
Yes, it's impossible. We'd have to have started years ago. And then people wouldn't have the discipline to label content correctly or at all. It can't be done.
Just ask any person who works in teaching or any of the numerous faulty AI detectors (they're all faulty).
Any current technology which can used to accurately detect pre-AI content would necessarily imply that that same technology could be used to train an AI to generate content that could skirt by the AI detector. Sure, there is going to be a lag time, but eventually we will run out of non-AI content.
No, that's the problem. Pre-AI era content a) is often not dated, so not identifiable as such, and b) also gets out of date. What was thought to be true 20 years ago might not be thought to be true today. Search for the "half-life of facts".
If you're training an AI, do you want it to get trained on other AIs' output? That might be interesting actually, but I think you might then want to have both, an AI trained on everything, and another trained on everything except other AIs' output. So perhaps an HTML tag for indicating "this is AI-generated" might be a good idea.