The level of copying here is the copying *into the training set*, not the copyin...

hn_acker · on Dec 28, 2023

> The level of copying here is the copying into the training set, not the copying through use of the model.

NY Times is suing because of both the model outputs and the existence of the training set. But infringement in the training set doesn't necessarily mean that the model infringes. Why? Because of the substantial similarity requirement. But first, I'll address the training set.

For articles that a person obtains through legal methods (like buying subscriptions) but doesn't then republish, storing copies of those articles is analogous to recording a legally accessed television show (time-shifting), which generally is fair use. Currently, no court has ruled that "analogous to time-shifting" is good enough for the time-shifting precedent to apply, but I think the difference is not significant. The same applies to companies. Companies are not literally people, but there isn't a reason for the time-shifting precedent to not apply to companies.

What about the articles that OpenAI obtained through illegal methods? Then the very act of obtaining those articles would be illegal. The training set contains those copies, so NY Times can sue to make OpenAI delete those copies and pay damages. But it's not trivially obvious that a GPT model is a copy of any works or contains copied expression of the any works in the training set; the weights that make up the model represent millions of works, it's not trivially obvious that the model contains something substantially similar to the expression in a work in the training set. Therefore, it's not trivially obvious that infringement with respect to the training set amounts to infringement with respect to the model made from the training set. If OpenAI obtained NY Times articles through illegal means, then making OpenAI delete the training set would be reasonable, but the model is a separate matter.

As long as the model doesn't contain copied expression and the weights can't be reversed into something substantially similar to expression in the existing works, then what matters is the output of the model.

If a user gives a prompt which contains no reference to an existing NY Times author, work, or a strongly associated characteristic/style, then do OpenAI's models produce outputs substantially similar to expression in the existing works? If not, then OpenAI shouldn't be liable for infringing works, because the infringing works result from the user's prompts. If my premise is false, then my conclusion falls apart. But if my premise is true, then at most I would admit that OpenAI has a limited burden to prevent users from giving those prompts.