If the purpose is to identify text that can be used as training data, in some wa...

If the purpose is to identify text that can be used as training data, in some ways it makes sense to me to mark anything and everything that isn't hand-typed as AI generated.

Like for your last example: to me, the concept "proper scientific tone" exists because humans hand-typed/wrote in a certain way. If we use AI edited/transformed text to act as a source for what "proper scientific tone" looks like, we still could end up with an echo chamber where AI biases for certain words and phrases feed into training data for the next round.

Being strict about how we mark text could mean a world where 99% of text is marked as AI-touched and less than 1% is marked as human-originated. That's still plenty of text to train on, though such a split could also arguably introduce its own (measurable) biases...