We developers like to pretend that LLM's are akin to humans and that they've bee...

logicchains · on Dec 28, 2023

> it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.

It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.

lacrimacida · on Dec 28, 2023

>Just likes how humans compress the information we read.

Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.

I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.

wouldbecouldbe · on Dec 28, 2023

You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.

Let alone that it's a centralised model that's being distributed for a fee.

mihaic · on Dec 28, 2023

If you have a copyrighted photo that I simply put through jpeg compression, am I legally allowed to use that?

Software programs are not humans, and need to be treated differently. Anthropomorphization is one of the slipperiest paths to argue anything.

kromem · on Dec 28, 2023

It depends on how much is reproducible and what the use is.

If only small patches of the original image can be reproduced then it becomes much more murky.

jamiek88 · on Dec 28, 2023

If it’s lossy compressed how come they have verbatim content from NYT in there that’s easy to recall? That’s what the lawsuit is about.

anon291 · on Dec 28, 2023

Many humans have photographic memories. Not common, but not unheard of for people to be able to memorize long portions of text verbatim.

For example, the Wikipedia article

https://en.wikipedia.org/wiki/List_of_people_claimed_to_poss...

contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.

amelius · on Dec 28, 2023

Yes, and a car is fast horse. Your argument does not tell us anything about whether or not GPT should be legal.

Laws are created by people (not by computers reasoning that all analogies must be true). And fairness is an important part of that process.

wouldbecouldbe · on Dec 28, 2023

So if compress nytimes articles into a vector database and query it is a vector then that's okay in line with your reasoning?

gumballindie · on Dec 28, 2023

> We developers pretend that LLM's are akin to humans and that they've been educational material.

Developers thinking LLMs are akin to humans arent the brightest crop, and are usually a topic of ridicule.

mikewarot · on Dec 29, 2023

>It's much simpler, proprietary writing is now integrated into the source code of OpenAI

The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.

None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.

LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.