From the original paper (linked in the article):

> ... When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two, but that the generated abstracts were vaguer and had a formulaic feel to the writing.

That last part is interesting because "vague" and "formulaic" would be words I'd use to describe ChatGPT's writing style now. This is a big leap forward from the outright gibberish of just a couple of years ago. But using the "smart BSer" heuristic will probably get a lot harder in no time.

Also, it's worth noting that just four human reviewers were used in the study (and are listed as authors). That's a very small sample size to draw conclusions from. The article doesn't mention level of expertise of these reviewers, but I suspect that could also play a role.

Some are focusing on the paper-mill angle. But I think the more interesting angle is ideation.

If researchers can't reliably tell the difference between machine and human generated abstracts, what kinds of novel experiments or even research programs could ChatGPT suggest that might never have been considered?