This vague situation with copyright plays against open-source AI models who have to disclose the sources of training data, while closed-source companies can freely use pirated material and get advantage over open-source models.
I’m normally skeptical of claims like this, but looking at the examples it seems that Sora is reproducing some of its training data verbatim. I guess it’s a case of overfitting? In particular the Civ example seems like it must have been copied almost verbatim.