I can talk a lot about this, since this is the space I've spent a lot in experimenting. All I will say is that all these detectors (a) create a ton of false-positives, and (b) are incredibly easy to bypass if you know what you are doing.
As an example, one method that I found that works extremely well is to simply rewrite the article section by section with instructions that require to mimic the writing style of an arbitrary block of human written text.
This works a lot better than (as an example) asking to write in a specific style. Like, if I just say something along the lines of "write in a casual style that conveys lightheartedness towards the topic" is not going to work as good as simply saying "rewrite mimicking the style in which the following text block is written X" (where X is an example of a block of human written text).
There are some silly things that will (a) trigger human written text to be detected as a AI and (b) that allow to avoid AI detection, e.g. using broad dictionary tends to trigger AI bots to detect the text as written by AI. So if you are using Grammarly to "improve your writing", then don't be surprised if it gets flagged. The inverse is true too. If you some statistical analyzes to replace less common expressions with more common expressions, AI-text is less likely to be detected as AI.
If someone is interested, I can talk a lot more about hundreds of experiments I've done by now.
> I can talk a lot about this, since this is the space I've spent a lot in experimenting.
So I'm a researcher in vision generation and haven't read too much about LLM detection but am aware of the error rates you mention. I have questions...
What I'm absolutely surprised by is the use of perplexity for detection. Why would you target perplexity? LMs are minimizing NLL/entropy. Then instruct based models are even more tuning in that direction such that the you're minimizing the cross-entropy as compared to human output (or at least human desired output). Which makes it obvious that it would flag generic or common patterns as AI generated. But I'm just absolutely baffled that this is the main metric being used, and in the case of this paper, the only metric. It also gives a very easy way to fool these detectors since it would suggest just throwing in a random word or spelling mistakes would throw off detection given that such actions clearly increase perplexity. To me this sounds like using a GAN's detector to identify outputs of GANs (the whole training method is about trying to fool the detector!) (Obviously I'm also not buying the zero-shot claim).
I will also add that, at least for now, if you are doing it for SEO, it _really_ doesn't matter. I was planning to make a case study benchmarking my algo against a bunch of other content generators. I was hoping for there to be statistically significant difference, but there was none. So, the thing that matters in the long-run is if the end-users find your content valuable, because that's how ultimately Google will decide whether to send more traffic to your content, rather than trying to detect if it was "AI generated".
I think the value of this is the extremely low false positive rate so it can act as a larger sieve when there is a large amount of inputs to test - What other Binocular style detectors have you experimented against where you're seeing a "ton of false positives"?
I use https://originality.ai/ as the benchmark. I've tested all commercially available services, and Originality (at the time; its been a few months) provided the lowest false-positive rate. As a testing sample, I've built a database of articles written by various text generators and compare them against articles that I scrapped from web from before 2017 (basically any text before LLMs saw daylight).
I am sure that these algorithms have evolved, but given my past experiments, I sincerely doubt that we are at a point that (a) cannot be easily bypassed if you are targeting them, (b) do not create a lot of false-positives.
As stated in another comment, I personally "gave up" on trying to bypass AI detection [it often negatively impacts output quality], at least for my use case, and focus on creating highest-possible value content.
I know that services like Surfer SEO are continuing to actively invest in bypassing all detectors. But... as a human, I do not enjoy their content and that what matters the most.
Just for fun, I just tested a few recently generated articles with https://huggingface.co/spaces/tomg-group-umd/Binoculars (someone linked it in this tread) and it ranked them as "Human-Generated" (which I assume means human written). And... I am not even trying to evade AI detection in my generated content. I was wholeheartedly expecting to fail. Meanwhile, Originality detects AI generated content with 85% confidence, which is ... fair enough.
If I'm reading this correctly, it's not making any particular claim with respect to text labeled human generated. What it's saying is that if it claims the text is machine generated, it's highly likely that it actually is.
The article you're commenting on actually states in its abstract:
> Over a wide range of document types, Binoculars detects over 90% of generated samples from ChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being trained on any ChatGPT data.
Since you say you're knowledgeable on this, here's a question: If you have access to the model, wouldn't it be possible to inspect the sequence of token probabilities for a piece of text and derive from this a probability that the text was produced by that model at a given temperature? It would seem intuitive that the exact token probabilities are model specific and can be used to identify a model from its output given enough data.
I suppose an issue with this might be that an unknown prompt would add a lot of "hidden" information, but you could probably start from a guess or multiple guesses at the prompt.
That's pretty much how most of these methods work. It just doesn't work very well because good models have a reasonable probability of generating lots of different texts. So you don't get very different numbers on AI and Human generated texts. After all the models are trained to learn the probability distribution of exactly Human text.
It can be useful for small-scale verification in academics - TAs and schoolteachers can use it to ensure the assignments and homework submitted were actually worked on. Yes, an incumbent can spend more time and brains on making it look authentic despite using LLMs but you've already gone past a typical tardy student's usage pattern at that point - if she is too lazy to do her homework she can be safely assumed to be too lazy to spend time refining her prompts and weights as well.
How do you do this manual review? How can a human spot LLM-generated text? The internet is full of horror stories of good students getting failing grades due to false positive LLM detectors where the manual review was cursory at best.
I am curious actually! In general about your experiments, but also about integrating this detection algorithm to wider systems. Did you run any autogpt-like experiments with the AI generated text as a critique? My use case is a bit different (decision-making), so I play with relative plausibility instead of writing style. But I haven't found convincing ways of "converging" quite yet, i.e. benchmarks that don't rely solely on LLMs themselves to give their output.
To clarify, the style experiment I've referenced earlier was just that – an experiment. I did not implement those methods into my software. Instead, I focused on how to eliminate things like 'talking with authority without evidence', 'contradictions', 'talking in extremely abstract concepts', 'conclusions without insights', etc.
If you need a dataset to benchmark against, download any articles from pre 2017. There are a few ready-made datasets floating around the Internet.
Grammarly is used a lot by non-native English speakers translating their papers to English. I wonder how difficult publishing papers would be if AI checks become commonplace in the future.
As an example, one method that I found that works extremely well is to simply rewrite the article section by section with instructions that require to mimic the writing style of an arbitrary block of human written text.
This works a lot better than (as an example) asking to write in a specific style. Like, if I just say something along the lines of "write in a casual style that conveys lightheartedness towards the topic" is not going to work as good as simply saying "rewrite mimicking the style in which the following text block is written X" (where X is an example of a block of human written text).
There are some silly things that will (a) trigger human written text to be detected as a AI and (b) that allow to avoid AI detection, e.g. using broad dictionary tends to trigger AI bots to detect the text as written by AI. So if you are using Grammarly to "improve your writing", then don't be surprised if it gets flagged. The inverse is true too. If you some statistical analyzes to replace less common expressions with more common expressions, AI-text is less likely to be detected as AI.
If someone is interested, I can talk a lot more about hundreds of experiments I've done by now.