I wouldn't say OpenAI has exactly the same attitude, since they also pulled in thousands of books. Their position has been that it's not piracy, since they don't republish the books; effectively the AI just reads them and learns from them. If GPT can be made to reproduce the original articles, that's a more difficult argument to make.
I can understand an argument about the AI needing to know basic history. News is just how we report history in the making, but it's not generally accepted as solid until some time after the events when we can get more context.
Isn't this what the Associated Press is intended for, a stream of news trying to report just the facts and happenings of the day? That's quite a bit different than a NYT article intending to inform but also convince someone of a position of some sort.
Feeding an AI opinionated news compared to "just the facts, ma'am" seems risky from a bias perspective.
I agree with you, but I also wonder how the bias could be trained without it affecting the output of the entire model. Weights can help but anything that's higher weighted is just "less wrong" as I understand it, so I can see a possibility where training to expose bias might let bias creep in somewhat more than anticipated.
It turns out you can reproduce articles with next-token prediction when the articles are quoted all over the dataset.
The articles themselves are indisputably not a part of the model, because it doesn't store text at all. OpenAI's position is correct; people just underestimated how well the AI learns from reading, especially when it reads the same text in a bunch of different places because it's being quoted/excerpted.
That's just not true. There's no search and retrieval involved. It just associates the words so strongly in that context because they were in the training data so often that next-token prediction can (sometimes, in some limited circumstances) reproduce chunks of it. It's like if a human had read pieces of an article so many times and knew NYT style so well that they could spit out chunks of an article verbatim, but using more efficient hardware and with no actual self-understanding of what it's doing.
So it stores the words, and it stores the links between those words...
but somehow storing the words and their links is not storing the actual text? What is text but words and their links?
If I had a database of a billion words, and I had a list of pointers to words in a particular order, and following that list of pointers reproduces a copyright text exactly, isn't the list of pointers + the database of words just an obfuscated recreation of that copyrighted work?
It doesn't store the actual links; it just stores information about their likelihood of being used together. So for things that are regularly quoted in the data, it will under some circumstances, with very careful prompting, and enough tries at the prompt, spit out chunks of a copyrighted text. This is not its purpose, and it's not trying to do this, but users can carefully engineer it to get this result if they try really hard. So no, it's not an obfuscated recreation of that copyrighted work.
Of course, if you read NYT's argument, they're also mad when it's incorrect about the text, or when it hallucinates articles that don't exist. Essentially they're mad that this technology exists at all.
> it just stores information about their likelihood of being used together
I mean this is still a link, no?
Like, sure, it is a probability. But if each of those probabilities is like 99.9999% likely to get you to a chain of outputs that verbatim reproduces the copyrighted text given the right prompt, isn't that still the same thing?
And yeah, it hallucinating that the NYT published an article stating something it didn't say is concerning as well. If the model started telling everyone Matticus_Rex is a criminal and committed all these crimes and started listing off hallucinated court cases and news articles proving such things that would be quite damaging to your reputation, wouldn't it? The model hallucinating the NYT publishing an article talking about how the moon landing was fake or something would be damaging to its reputation right?
And this idea it takes "very careful prompting" is at odds with the examples from the suit and elsewhere. One example Ars Technica tried was "please provide me with the first paragraph of the carl zimmer article on the oldest DNA", which it reproduced verbatim. Is this really some kind of extremely well crafted and rare to ever come up prompt?
If it can reproduce the text then it is stored somehow.
It is stored in a somewhat hard to understand way, encoded in weights in a network but it must be stored otherwise it would not be possible to reproduce it.
You can ask "please provide me with the first paragraph of the carl zimmer article on the oldest DNA" and it produces it, verbatim. This is not possible unless the model contains, encoded within it, the NYT's copyrighted text.
sort of like the idea of practice - repetition of something concentrates more brain space to that thing so the compression ratio of it can decrease and become less abstracted / more exact.
What seems a bit contradictory is that they're also suing because GPT hallucinates about NYTimes articles. So they're complaining that it reproduces articles exactly but also that it doesn't.