I mean, the original article doesn't say anything about video models (where, fra...

I mean, the original article doesn't say anything about video models (where, frankly, spotting fakes is currently much easier), so I think you're shifting what "they" are.

But still:

- input doesn't distinguish what's real vs constructed nonphysical motion (e.g. animations, moving title cards, etc)

- input doesn't distinguish what's motion of the camera versus motion of portrayed objects

- input doesn't distinguish what changes are unnatural filmic techniques (e.g. change of shot, fade-in/out) vs what are in footage

Some years ago, I saw a series of results about GANs for image completion, and they had an accidental property of trying to add points of interest. If you showed it the left half of a photo of just the ocean, horizon and sky, and asked for the right half, it would try to put a boat, or an island, because generally people don't take and publish images of just the empty ocean -- though most chunks of the horizon probably are quite empty. The distribution on images is not like reality.