Tests suggest clues of whose content was used to train OpenAI’s Sora

DroneBetter · 2025-10-01T20:34:44 1759350884

https://archive.is/ozjEb (note some of the gifs become static images here)

codedokode · 2025-10-01T22:34:08 1759358048

This vague situation with copyright plays against open-source AI models who have to disclose the sources of training data, while closed-source companies can freely use pirated material and get advantage over open-source models.

smegma2 · 2025-10-01T22:39:48 1759358388

I’m normally skeptical of claims like this, but looking at the examples it seems that Sora is reproducing some of its training data verbatim. I guess it’s a case of overfitting? In particular the Civ example seems like it must have been copied almost verbatim.

viewtransform · 2025-10-02T00:06:08 1759363568

Title goes on to say:

Tests by The Post suggest the training data for OpenAI’s video generator Sora included versions of movies, TikTok clips and Netflix shows.