Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.

But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.

The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.



> it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.

It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.


>Just likes how humans compress the information we read.

Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.

I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.


You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.

Let alone that it's a centralised model that's being distributed for a fee.


If you have a copyrighted photo that I simply put through jpeg compression, am I legally allowed to use that?

Software programs are not humans, and need to be treated differently. Anthropomorphization is one of the slipperiest paths to argue anything.


It depends on how much is reproducible and what the use is.

If only small patches of the original image can be reproduced then it becomes much more murky.


If it’s lossy compressed how come they have verbatim content from NYT in there that’s easy to recall? That’s what the lawsuit is about.


Many humans have photographic memories. Not common, but not unheard of for people to be able to memorize long portions of text verbatim.

For example, the Wikipedia article

https://en.wikipedia.org/wiki/List_of_people_claimed_to_poss...

contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.


Yes, and a car is fast horse. Your argument does not tell us anything about whether or not GPT should be legal.

Laws are created by people (not by computers reasoning that all analogies must be true). And fairness is an important part of that process.


So if compress nytimes articles into a vector database and query it is a vector then that's okay in line with your reasoning?


> We developers pretend that LLM's are akin to humans and that they've been educational material.

Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.


>It's much simpler, proprietary writing is now integrated into the source code of OpenAI

The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.

None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.

LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: