The methodology is terrible. The prompting was as simple as : "Can you produce a 300- to 400-word summary on this topic: INSERT TOPIC HERE" where some example topics are:
A surprising fossil vertebrate
Stem cells remember insults
I can't see how that prompt is going to come up with anything comparable to the human text which is based on perspectives articles in Science.
And they don't report these numbers but I can from the tables.
Document false positive rate (human assigned as AI) 0%
Document false negative rate (AI assigned as human) 0%
Paragraph false positive rate (human assigned as AI) 14%
Paragraph false negative rate (AI assigned as human) 3%
In summary though this is a garbage tier study for entertainment only
Also, their classifier just uses manual features rather than doing any sort of meaningful analysis. As in, the input to their model is a tuple of ~20 features consisting of things like, whether the article contains the word "but", or the character "?".
All of this is easily fixable by using better prompts. It's 99% effective for a small set of articles from one journal (sampling not specified), assuming that the synthetic samples are created with the minimum level of effort using GPT-3.5.
It's not that their methods are wrong, it's just that they're absolutely useless for anything beyond an extremely narrow range of data. In the ML field this is called overfitting.
It may be out of their expertise as chemists, but it may be well within their domain as educators. If their students are turning in GPT-written papers, then it's their problem.
I don't know what kinds of papers their students write. I don't recall writing a lot of papers in chemistry class, but they may be teaching a different kind of class. I could imagine them asking for papers on "careers in chemistry" or "history of some famous chemist" that students could cheat with GPT.
Even if that's the case, their paper should probably be more specific to the chemistry domain. Though perhaps they're called on to teach General Science classes as well, which could be even more likely to generate essay assignments prone to cheating.
Because it would be more directly inspired by real examples, and make their testing set more believable. Otherwise, they're more likely to be making elementary mistakes about what constitutes a good test of machine-generated text in general.
> The authors of the study are all from the chemistry department at the University of Kansas. Is this really the sort of paper they should be authoring?
Are chemists typically good at writing AI tools? Is the fact that they are professors interested in catching students cheating instead of adapting their methods relevant?
> Scientists are more likely to have a richer vocabulary and write longer paragraphs containing more diverse words than machines. They also use punctuation like question marks, brackets, semicolons more frequently than ChatGPT, except for speech marks used for quotations.
> Real science papers also use more equivocal language – like "however", "but", "although" as well as "this" and "because".
You can prompt ChatGPT to write in that style. For example, I have a semi-standard prompt I often use:
“Don’t just give a list of information; write in engaging and well-written paragraph form with transitions. (It’s ok to use some bullet points or lists, but the entire piece should not be a list.) Don’t use emojis and don’t overuse any of the following words (or their conjugates or other tenses): ensure, enable, seamless, solution, crucial, critical, vital, invaluable, essential, welcome, game-changer, unleash, or streamline. You can use each once or twice at most. Vary sentence length to improve readability.”
"Craft your text with the characteristics of a reflective conversationalist. Be sure to weave in personal insights, express opinions, hesitate, reconsider, use colloquial language, and make occasional minor errors to reflect human thought patterns."
Merged one I've been using with yours, seems to output articles of higher quality.
"Write as a reflective human conversationalist, using transitional words and varied sentence types for conciseness and readability. Combine personal insights, opinions, and colloquial language with occasional very minor errors. Group similar info, follow the "one-idea rule," and avoid overusing certain words, emojis, and excessive lists. Vary sentence length to improve readability and engage the reader."
I see you are further along the path to prompting wisdom than op.
Positive directions are far superior to negative ones. LLMs don't understand the concept of negation. They work, in a way, by assigning semantic tokens that describe the relationship to other tokens. How could you encode the idea of a negated token?
I have wondered if you could use a similar mechanism to the repetition penalty to penalise chosen tokens that you want to downrank.
I this idea of LLMs not understanding the concept of negation is false.
Both if you ask ChatGPT (GPT4) to explain the concept of negation or ask it to do a task that leverages negation, you get sound results.
Example of a conversation leveraging negation:
User
Hello GPT. Could you write me a sentence longer than 6 words ?
ChatGPT:
Of course, I'd be happy to help! Here's a sentence for you: "The sun set behind the mountains, casting a stunning array of colors across the evening sky."
User:
Now, could you write me a sentence not longer than 6 words ?
ChatGPT:
Absolutely, here it is: "Cats quietly nap in the sun."
It counts approximately in my experience. If you tell it to use 5 word sentences it might occasionally use 6 or 7 words but it will never use 30 words.
This is with GPT-4, it's possible GPT-3.5 is much worse.
I have a lot of doubts about this research both methodologically and in principle after skimming the paper.
1. All this is only on GPT-3.5.
2. "Accuracy" is an ambiguous term here. What we really want is sensitivity(how likely the test is to identify true positives) and specificity(same for true negatives). Maybe this is buried in the text somewhere but I couldn't find it.
3. They only did a very narrow niche, namely academic papers. Doing machine learning on a highly selective dataset is definitely easier, and the results may not generalise well.
4. I have a strong feeling human language will evolve towards being harder and harder to distinguish from GPT. That means these results could be highly sensitive to the time period from which the non-GPT writing was taken.
I've been thinking about the strange loop that emerges when enough humans hire human-emulating machines to ghostwrite enough of their correspondence. Book reports, inane business emails, wedding vows. All of it. The full spectrum of the interhuman social experience, as expressed in and through language.
I tend to think that, as LLMs become ubiquitous and our interfaces to them get more and more "natural," the lines between human-authored text and LLM outputs will converge and get really, really blurry – more so, even, than they already are. The word "singularity" doesn't quite capture what I'm getting at, but "Hamiltonian" might. Speech acts, multinomial logits. Tomato, tomaahto. Aren't they ultimately just two symplectic representations of the very same thing – the richness and expressivity of human language? One can't evolve independently of the other. Or can it?
They also tested the wrong thing. They should have fine tuned a model on the "real perspectives", then have that model generate paragraphs, and then compare detector performance of "generated real perspectives" vs. "real perspectives".
The title was originally 100% but the editor felt that to be too unbelievable and ratcheted it back to just greater than 99% , that way it covers 100, but looks fancier, there's a math equation in the title, it's gotta be legit!
There is no chance whatsoever that any tool will ever be able to reliably tell the difference between LLM and human content, and I can’t understand how anyone thinks such a thing is possible.
There's no mechanism of action for such a thing. The information would have to be encoded in the text and it isn't.
Honestly the best way to think about it is to invoke the infinite number of monkeys scenario since believing this requires you to disprove the infinite monkeys theorem.
Consider this thought experiment.
1) We will start with a piece of text that your detector is 100% certain was created by a GPT tool.
2) Now, prove that there is no way whatsoever for at least one human being to independently create this text.
If you can’t prove that, then your tool is bullshit.
There's a whole population out there that is fundamentally convinced that humans have some innate superiority over LLMs in terms of reasoning that will buy into this because they believe there will always be a difference for any text of meaningful length.
My main counter to that is that any such detector can only keep working if it's inaccessible to people directly or indirectly without huge consequences, because if it's accessible, it'll be probed and used to produce test cases to improve LLMs with and/or to train models to "spin"/postprocess LLM output. The latter can be done much "cheaper", e.g. consider the article:
> Real science papers also use more equivocal language – like "however", "but", "although" as well as "this" and "because".
Even if you can't circumvent this just with careful prompting (and you probably can in this specific case), this kind of information alone is enough to know you need to adjust the frequency of those words to better match scientific papers, whether by fine-tuning on some science papers or training a small model to know when it's ok to interject them in a text.
At a minimum you need to host and rate limit access to such a detector, or you can just automate working around it.
AI to detect AI. I’m skeptical that AI will lead to the end of the world, but who knows. At this rate, the AIs will go to war on each other and we’ll just be casualties of the crossfire.
That's what I think about too, AI competition. Which would suck because we'll probably die and also the AI will spend it's time preparing for wars and not being awesome. It would probably look like biological evolution, AI spawning new AI to beat other AIs and on and on.
I honestly hope they don’t crack this problem, since it’s created a lovely existential crisis for term papers. It’s forcing long-overdue innovation in how we assess knowledge.
Consider that any such detector or even method that is made available becomes a tool to finetune models to evade.
Alternatively "spinning" tools to create slightly modified content (for black hat SEO, to avoid duplicate detection) has been a thing for years, and you can bet tools that'll take model output and apply small models to them to mutate their wording without corrupting meaning will be a big market and will see evading AI detectors as part of their feature set.
It's an unwinnable battle, that just produces competitive pressure towards focusing on making AI's write more like humans.
The trouble with percentages like this is that the last <1% are difficult to achieve but provide the most value. If you're a teacher looking to prevent cheating by students, you can't take the chance of falsely accusing one student out of 100.
Is not even that - this entire idea just misses the point entirely.
It doesn't matter if a very particular model $A$ under particular conditions can be identified with another particular model $C$. That can very easily be circumvented by using some other model (Vicuna, GPT-4, Claude, etc) in different conditions to fool the model $C$.
It's a constant race that will never be won and shouldn't even really be attempted.
One idea for determining if the content is generated by OpenAI that requires far less work is to for OpenAI to just make a database table of (document_id_hash, text_shingle_hash, shingle_weight_by_frequency) to make a system like old plagiarism detection. It's incredibly fast and simple, so uses far less compute than a big ML model to do an approximate but worse method.
Yes, yes, the API, other models, etc. But that was already a problem (see above).
I'll take your point further. If you have a college class of 1000 students and they submit 5 papers a year, that is 5000 papers. 1% miss would be 50 student assignmentss accused of plagiarism. This isnt detecting cats, this is about lives being seriously affected.
Further, i'm curious how universal the accuracy is...do certain subsets incur more of the burden?
Very much education level and field dependent. It's certainly the case that high schools will grade on coursework more. And, further, many fields primarily grade based on papers, because these are the currency of knowledge, and research skills are difficult to test in an exam setting.
I think what will happen is that there effectively be an arms race between the bot generation and detection sides. As anomalies are discovered in the distribution of content generated by bots vs humans, bots will update to remove the anomaly. And researchers will try to find new anomalies. In the end as bots get better, the difference in the two distributions will be indistinguishable in practice.
For any accessible detector you can automate the evasion. E.g, crudest variant would be:
generate text
run it through the detector
finetune model w/result of the detector
repeat until detector fails to detect.
But every time the failure rate of the detector starts getting high, the other side has a lot of development work to do finding new patterns that won't also lead to lots of false positives.
So while getting a point where it's totally impossible to detect may well take a long time, it'll be periods of the detector suddenly being hard to bypass for a while, and gradually getting easier, repeat. It's highly unlikely it'll ever be impossible to bypass for any length of time.
The only way of keeping such a detector working for any length of time will be to make it expensive or very hard for adversaries to get access to, which means effectively relying on only giving access to trusted users.
This assumes the ability to perfectly model this function, which is in a sense what we're setting out to demonstrate. It can certainly not be taken for granted.
Firstly, it doesn't need to perfectly model it. It needs to model it to within a range where claiming to have detected it causes more false positives than acceptable, and given the range of quality of human output that's a pretty wide span.
But secondly, LLMs wired up with a feedback loop and and appropriate IO can model Markov models (models, not chains) which are Turing complete, so a LLM wired up this way is clearly Turing complete as well. As such it's a question of training, not whether or not the model can model it.
Because eventually LLM based AI will be advanced enough to author its text with as much variety and diversity as any human, making it indistinguishable from anything authored by a human.
That feels like a _huge_ assumption. It would arguably be quite surprising just based on the prior history of ML-y stuff; there's a tendency for things to get 95% good enough quickly, and then never quite make it to 100% (because 100% is really a completely different sort of problem, arguably requiring AGI).
My favourite example of this: speech recognition. In the 90s this went quite quickly from a joke to a tech demo that required enormously expensive hardware to a tech demo that you could run on your laptop to actually marginally useful (you may remember Apple and particularly MS claiming that it was the future, and would soon be the primary way that people interacted with computers).
And then, well... like, voice recognition today is certainly better than Dragon Dictate was 25 years ago, but you probably still wouldn't want to trust it with anything important, and it's certainly far worse at it than a human is.
I'm struggling to think of _any_ ML-ish thing in the human-imitating space which has gotten all the way to virtually perfect/as good as human. Maybe OCR? Though even then, you still sometimes get odd results.
Oh, yeah, actually, fair, a few games. Though, can an expert human player reliably determine that they're playing against a machine? I suspect they can, but I don't know.
In general the goal of these progrms is to win, not to pretend to be human.
In chess, there is a project called Maia which aims at predicting the human move rather than the best move. Even then it blunders less than humans of a similar rating, so it can still be detected.
With all these claims of ChatGPT content detection, it assumes that the person using ChatGPT is just taking the raw output and not doing any editing or verification. We are treading onto Ship of Theseus grounds here, but how much has to change before it's not important?
For example:
> "One of the biggest problems is that it assembles text from many sources and there isn't any kind of accuracy check – it's kind of like the game Two Truths and a Lie."
That's not a ChatGPT problem. It's an accuracy problem. If the output is edited to fix the accuracy, then it's no longer an issue.
Yes, blindly using output can be an issue, the issue is accuracy, not the method used to generate it. If ChatGTP output was 100% accurate, what would be the next problem?
Surely what matters more than detecting generated content is verifying that the information is true and consistent and limiting the quantity and length of submissions per verified user of a system.
That takes care of the quality of the submission. The other part is determining attribution. Why not just ask the human submitter to defend their work in a controlled environment? If they can, it's their paper now.
I don't see a difference between unassisted people turning in crap vs assisted people still turning in crap.
What if I take chatgpt output and partially rewrite it in my own words, or add or mix in some original content? What result does this identification tool return?
The chance that LLMs won't be able to produce text in a certain style is extremely low.
What is true, is that by default ChatGPT doesn't produce the same style as academic papers, which seems unsurprising. You could also compare it to a 15 year old writing text messages and conclude that ChatGPT content is identifiable by that metric, which, again, seem unsurprising.
I had to try to ask it to rewrite this to 15yo text speak, and while I 100% agree with you, it clearly needs some more detail prompting to get a result that doesn't sound like an adult trying and failing to pass for a teenager. Funny, though:
> So like, for real, ChatGPT doesn't spit out stuff that sounds like those super boring academic papers by default, which is like totally no shocker, right? I mean, you could even compare it to us teens textin' and all, and you'd find out that ChatGPT vibes are different, and, honestly, who's surprised by that? Lol.
This is another reason to use OSS LLMs: the bar to entry is much higher, so they’re not targeted for countermeasures and they circumvent human pattern-matching.
You’re less likely to be “found out”, especially with there being LOTS of fine-tunes and even a couple fundamental models.
Though I don’t think this study is accurate, people are trying and they’re only targeting “Open”Ai’s products.
> Here is some text. Please rewrite it as follows: 1. DO NOT change the content more than the bare minimum needed to comply with the instructions. 2. Rewrite to add at least one question mark, and one semicolon. Add a "however", "but", "although" and "because", and rewrite the sentences so it flows naturally and without changing the meaning. Here is the paragraph you should rewrite:
>
> They believe their classifier is effective, because it homes in on a range of stylistic differences between human and AI writing. Scientists are more likely to have a richer vocabulary and write longer paragraphs containing more diverse words than machines. They also use punctuation like question marks, brackets, semicolons more frequently than ChatGPT, except for speech marks used for quotations.
ChatGPT output (GPT4):
> Although they have confidence in their classifier's effectiveness due to its ability to identify a variety of stylistic differences between human and AI writing, a question emerges; why is this believed to be the case? It's because scientists are generally more likely to utilize a richer vocabulary and construct longer paragraphs filled with a diverse set of words, but can this solely account for the difference?
>
> They also employ punctuation, such as question marks, brackets, and semicolons, with a higher frequency compared to ChatGPT; however, an exception is the use of speech marks for quotations.
Of course their detector might well handle that too, but GPT4 is perfectly capable of writing text with very varying level of styles and vocabulary, and if their detection relies on that, odds are people will prompt their way around that quite easily before even looking at more advanced approaches like e.g. finetuning on scientific papers within the right field etc.
Do people really not understand that they’re just building a GAN (generative adversarial network) and the endgame of the arms race is that not even detector machines will be able to tell the diference within a margin of statistical error? Unless that is the goal, this just helps the machines in the long-term.
The real danger of this kind of work is that you'll have a shoddy product that misclassifies student/academic writings and people will get tagged as cheaters when they aren't.
It only works for a highly specific type of article that's found in scientific journals ('perspectives' articles), and at the paragraph level it was closer to 92% correct.
It sounds like they didn't do any follow up prompts. The first response from chatGPT is usually inadequate, but with some follow ups you can get a much better one.
As usual, journalists trying to explain what scientists do and misrepresenting the facts.
The paper mentions accuracy i.e. (true positives + true negatives) / total examples. And it's actually 100% accurate i.e. there are no false positives _or_ false negatives.
But the big caveats are:
1. this was tested only on 180 examples, which is a very very small dataset to draw conclusions on, and
2. this is obviously an adversarial space so any classifier will be obsolete with the next training run
I'm bearish on any attempt to distinguish real content vs. AI generated content (on any medium, text, image or anything else). This is an adversarial game and the AIs can incorporate your fancy algorithm to fool you better. In the end these projects only end up improving the AI models in terms of realism.
> 1. this was tested only on 180 examples, which is a very very small dataset to draw conclusions on, and
If you have 180 samples, and a >99% accuracy (meaning a single misprediction), that is a statistically significant conclusion with a p-value of 99.994%.
Furthermore, from the article it was only whole scientific papers written ChatGPT in which they attained 99%; at the "paragraph level" they claim 92%, which means the claim of "AI generated content" in the headline is overly broad IMHO
The methodology is terrible. The prompting was as simple as : "Can you produce a 300- to 400-word summary on this topic: INSERT TOPIC HERE" where some example topics are:
A surprising fossil vertebrate
Stem cells remember insults
I can't see how that prompt is going to come up with anything comparable to the human text which is based on perspectives articles in Science.
And they don't report these numbers but I can from the tables.
Document false positive rate (human assigned as AI) 0%
Document false negative rate (AI assigned as human) 0%
Paragraph false positive rate (human assigned as AI) 14%
Paragraph false negative rate (AI assigned as human) 3%
In summary though this is a garbage tier study for entertainment only