We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.
But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.
The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.
> it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.
It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.
>Just likes how humans compress the information we read.
Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.
I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.
You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.
Let alone that it's a centralised model that's being distributed for a fee.
contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.
>It's much simpler, proprietary writing is now integrated into the source code of OpenAI
The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.
None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.
LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.
But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.
The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.