The reality was that nobody could have predicted the A.I breakthroughs when Open...

burnerthrow008 · on March 1, 2024

I have a slightly more cynical take:

Training LLMs requires a lot of text, and, as a practical matter, essentially all LLMs have committed copyright infringement on an industrial scale to collect training data.

The US has a fair-use exception with a four-part test:

The second and third parts (nature of the work (creative) and how much of the work is used (all of it)) strongly favor copyright owners. The fourth part (which SCOTUS previous said is the most important part, but has since walked back) is neutral to slightly favoring the copiers: Most LLMs are trained to not simply regurgitate the input, so a colorable argument exists that an LLM has no impact on the market for, say, NY Times articles.

Taken together, parts 2 through 4 are leaning towards impermissible use. That leaves us with the first part: Could it make the difference? The first part really has two subparts: How and what are you using it for?

"How" they are using it is clearly transformational (it defeats the purpose of an LLM if it just regurgitates the input), so that argues in favor of copiers like OpenAI.

But where I think Altman had a brilliant/evil flash of genius is that the "what" test: OpenAI is officially a non-profit, dedicated to helping humanity: That means the usage is non-commercial. Being non-commercial doesn't automatically make the use fair use, but it might make the difference when considering parts 2 through 4, plus the transformativity of the usage.