I have a lot of doubts about this research both methodologically and in principle after skimming the paper.
1. All this is only on GPT-3.5.
2. "Accuracy" is an ambiguous term here. What we really want is sensitivity(how likely the test is to identify true positives) and specificity(same for true negatives). Maybe this is buried in the text somewhere but I couldn't find it.
3. They only did a very narrow niche, namely academic papers. Doing machine learning on a highly selective dataset is definitely easier, and the results may not generalise well.
4. I have a strong feeling human language will evolve towards being harder and harder to distinguish from GPT. That means these results could be highly sensitive to the time period from which the non-GPT writing was taken.
I've been thinking about the strange loop that emerges when enough humans hire human-emulating machines to ghostwrite enough of their correspondence. Book reports, inane business emails, wedding vows. All of it. The full spectrum of the interhuman social experience, as expressed in and through language.
I tend to think that, as LLMs become ubiquitous and our interfaces to them get more and more "natural," the lines between human-authored text and LLM outputs will converge and get really, really blurry – more so, even, than they already are. The word "singularity" doesn't quite capture what I'm getting at, but "Hamiltonian" might. Speech acts, multinomial logits. Tomato, tomaahto. Aren't they ultimately just two symplectic representations of the very same thing – the richness and expressivity of human language? One can't evolve independently of the other. Or can it?
They also tested the wrong thing. They should have fine tuned a model on the "real perspectives", then have that model generate paragraphs, and then compare detector performance of "generated real perspectives" vs. "real perspectives".
1. All this is only on GPT-3.5.
2. "Accuracy" is an ambiguous term here. What we really want is sensitivity(how likely the test is to identify true positives) and specificity(same for true negatives). Maybe this is buried in the text somewhere but I couldn't find it.
3. They only did a very narrow niche, namely academic papers. Doing machine learning on a highly selective dataset is definitely easier, and the results may not generalise well.
4. I have a strong feeling human language will evolve towards being harder and harder to distinguish from GPT. That means these results could be highly sensitive to the time period from which the non-GPT writing was taken.