I can talk a lot about this, since this is the space I've spent a lot in experim...

godelski · on Jan 24, 2024

> I can talk a lot about this, since this is the space I've spent a lot in experimenting.

So I'm a researcher in vision generation and haven't read too much about LLM detection but am aware of the error rates you mention. I have questions...

What I'm absolutely surprised by is the use of perplexity for detection. Why would you target perplexity? LMs are minimizing NLL/entropy. Then instruct based models are even more tuning in that direction such that the you're minimizing the cross-entropy as compared to human output (or at least human desired output). Which makes it obvious that it would flag generic or common patterns as AI generated. But I'm just absolutely baffled that this is the main metric being used, and in the case of this paper, the only metric. It also gives a very easy way to fool these detectors since it would suggest just throwing in a random word or spelling mistakes would throw off detection given that such actions clearly increase perplexity. To me this sounds like using a GAN's detector to identify outputs of GANs (the whole training method is about trying to fool the detector!) (Obviously I'm also not buying the zero-shot claim).

antoniojtorres · on Jan 24, 2024

Yeah, agreed. In my experience what it’s ended up detecting is very crappy human written text

adaboese · on Jan 24, 2024

I will also add that, at least for now, if you are doing it for SEO, it _really_ doesn't matter. I was planning to make a case study benchmarking my algo against a bunch of other content generators. I was hoping for there to be statistically significant difference, but there was none. So, the thing that matters in the long-run is if the end-users find your content valuable, because that's how ultimately Google will decide whether to send more traffic to your content, rather than trying to detect if it was "AI generated".

palkptest · on Jan 24, 2024

I think the value of this is the extremely low false positive rate so it can act as a larger sieve when there is a large amount of inputs to test - What other Binocular style detectors have you experimented against where you're seeing a "ton of false positives"?

adaboese · on Jan 24, 2024

I use https://originality.ai/ as the benchmark. I've tested all commercially available services, and Originality (at the time; its been a few months) provided the lowest false-positive rate. As a testing sample, I've built a database of articles written by various text generators and compare them against articles that I scrapped from web from before 2017 (basically any text before LLMs saw daylight).

I am sure that these algorithms have evolved, but given my past experiments, I sincerely doubt that we are at a point that (a) cannot be easily bypassed if you are targeting them, (b) do not create a lot of false-positives.

As stated in another comment, I personally "gave up" on trying to bypass AI detection [it often negatively impacts output quality], at least for my use case, and focus on creating highest-possible value content.

I know that services like Surfer SEO are continuing to actively invest in bypassing all detectors. But... as a human, I do not enjoy their content and that what matters the most.

adaboese · on Jan 24, 2024

Just for fun, I just tested a few recently generated articles with https://huggingface.co/spaces/tomg-group-umd/Binoculars (someone linked it in this tread) and it ranked them as "Human-Generated" (which I assume means human written). And... I am not even trying to evade AI detection in my generated content. I was wholeheartedly expecting to fail. Meanwhile, Originality detects AI generated content with 85% confidence, which is ... fair enough.

rossjudson · on Jan 24, 2024

If I'm reading this correctly, it's not making any particular claim with respect to text labeled human generated. What it's saying is that if it claims the text is machine generated, it's highly likely that it actually is.

akie · on Jan 24, 2024

The article you're commenting on actually states in its abstract:

> Over a wide range of document types, Binoculars detects over 90% of generated samples from ChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being trained on any ChatGPT data.

svara · on Jan 24, 2024

Since you say you're knowledgeable on this, here's a question: If you have access to the model, wouldn't it be possible to inspect the sequence of token probabilities for a piece of text and derive from this a probability that the text was produced by that model at a given temperature? It would seem intuitive that the exact token probabilities are model specific and can be used to identify a model from its output given enough data.

I suppose an issue with this might be that an unknown prompt would add a lot of "hidden" information, but you could probably start from a guess or multiple guesses at the prompt.

thomasahle · on Jan 24, 2024

That's pretty much how most of these methods work. It just doesn't work very well because good models have a reasonable probability of generating lots of different texts. So you don't get very different numbers on AI and Human generated texts. After all the models are trained to learn the probability distribution of exactly Human text.

dartharva · on Jan 24, 2024

It can be useful for small-scale verification in academics - TAs and schoolteachers can use it to ensure the assignments and homework submitted were actually worked on. Yes, an incumbent can spend more time and brains on making it look authentic despite using LLMs but you've already gone past a typical tardy student's usage pattern at that point - if she is too lazy to do her homework she can be safely assumed to be too lazy to spend time refining her prompts and weights as well.

StrauXX · on Jan 24, 2024

I would not want to trust grades, in some cases even decisions about pass or fail, to a system which is prone to false positives.

_puk · on Jan 24, 2024

Agreed, we shouldn't trust the system, but using it as a bloom filter to flag those that should be reviewed manually seems warranted.

If all we're getting is false positives then it can be used to reduce the workload.

If we also get false negatives then we'd be better off using existing techniques (manual or otherwise).

a_bonobo · on Jan 24, 2024

How do you do this manual review? How can a human spot LLM-generated text? The internet is full of horror stories of good students getting failing grades due to false positive LLM detectors where the manual review was cursory at best.

gmerc · on Jan 24, 2024

Or you know, assess people fairly face to face.

klyrs · on Jan 24, 2024

Which we know to be unfair due to learned biases...

timinou · on Jan 24, 2024

I am curious actually! In general about your experiments, but also about integrating this detection algorithm to wider systems. Did you run any autogpt-like experiments with the AI generated text as a critique? My use case is a bit different (decision-making), so I play with relative plausibility instead of writing style. But I haven't found convincing ways of "converging" quite yet, i.e. benchmarks that don't rely solely on LLMs themselves to give their output.

adaboese · on Jan 24, 2024

To clarify, the style experiment I've referenced earlier was just that – an experiment. I did not implement those methods into my software. Instead, I focused on how to eliminate things like 'talking with authority without evidence', 'contradictions', 'talking in extremely abstract concepts', 'conclusions without insights', etc.

If you need a dataset to benchmark against, download any articles from pre 2017. There are a few ready-made datasets floating around the Internet.

neurostimulant · on Jan 24, 2024

Grammarly is used a lot by non-native English speakers translating their papers to English. I wonder how difficult publishing papers would be if AI checks become commonplace in the future.

3abiton · on Jan 24, 2024

Please go more ibto details on thos expeeiments!