Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the screenshot for the article you can see that the LLM says it is "Searching for: carl zimmer article on the oldest DNA". That, and what I know about how LLMs work, suggest to me that rather than the article being stored inside the trained LLM it was instead downloaded in response to the question. So the fact that the system is providing the full text of the article doesn't really go to whether training the LLM is a transformative use or not.


Yes, the screenshot in the article is clearly doing an Internet search. The exhibit in the lawsuit shows that you can complete an article by using GPT on the first sentence of the prompt, with low temperature to aid reproducibility, and obtaining the original except for a single word. That is another thing, and it shows that the LLM has basically recorded the original text into its weights in compressed form: https://pbs.twimg.com/media/GCY4WC6XYAAq-JS?format=jpg&name=...


It would be curious to test this on a larger sample than just a few. It is hard to believe that a majority of NYT articles are verbatim stored in the weights of a web-wide LLM, but if that is the case it would be a pretty unbelievable revelation about their ability to compress an entire web’s worth of data. But, more likely, I assume it is a case of overfitting, or simply finding a prompt that happened to work well.

FWIW, I can’t replicate on either GPT 3.5 or 4, but it may be that OpenAI has added new measures to prevent this.


I have attempted this sort of thing with GPT 3.5 many times and never been successful, although I've still never been taken off of the GPT4 waiting list that I signed up for months ago and I'm not going to subscribe without trying it first. I [and presumably many thousands of others] have tried things like this with many LLMs and image generating models, but to my knowledge we've come up rather short. I've never managed to recreate anything verbatim and have struggled to get anything resembling a copyright infringement out of stable diffusion with the sole exception of a meme image of Willy Wonka.

That said, the meme image of Willy Wonka comes out of stable diffusion 1.5 almost perfectly with surprising frequency. Then again, this is probably because it appeared hundreds or thousands of times in the training set in all sorts of contexts because it's such a popular meme. There is a tension between its status as an integral part of language and its nature as a copyrighted screen grab.


You can't reproduce on the web interface, because the temperature settings are higher than what's required to compress the text. You need to use the API.

However, I had good luck reproducing poems on GPT 3.5, both copyrighted and not copyrighted, because the choice of words is a lot more "specific" so to speak, and therefore higher temperature isn't enough to prevent complete reproduction of the originals. See https://chat.openai.com/share/f6dbfb78-7c55-4d89-a92e-f4da23... (Italian; the second example is entirely hallucinated even though a poem with that title exists, while the first and third are recalled perfectly).


It doesn’t seem that surprising; compared to entire NYT articles, poems are short, structured and more likely to be shared in multiple places across the web.

I’m more surprised that it can repeat 100 articles; if that behaviour is consistent in larger sample sizes and beyond just NYT dataset (which might be repeated on the web more than other sources, causing overfitting), that would be impressive.

You could imagine at some point a large enough GPT5 or 6 or 7 will be able to memorize verbatim every corner of the web.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: