No disrespect but I am baffled by your statement that it learns, even to go so f...

spuz · on May 8, 2023

I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.

The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.

asimpletune · on May 8, 2023

I see what you're going for, and I respect your point of view, but also respectfully I think the logic is a little circular.

To say "it's not a database, it's a language model, and that means it extracts generalized patterns from viewing examples, just like humans" to me that just means that occasionally humans behave like language models. That doesn't mean though that therefore it thinks like a human, but rather sometimes humans think like a language model (a fundamental algorithm), which is circular. It hardly makes sense to justify that a language model learns like a human, just because people also occasionally copy patterns and search/replace values and variable names.

To really make the comparison honest, we have to be more clear about the hypothetical humans in question. For a human who has truly learned from looking at many examples, we could have a conversation with them and they would demonstrate a deeper sense of understanding behind the meaning of what they copied. This is something a LLM could not do. On the other hand, if a person really had no idea, like someone who copied answers from someone else in a test, we'd just say well you don't really understand this and you're just x degrees away from having copied their answers verbatim. I believe LLMs are emulating this behavior and not the former.

I mean, how many times in your life have you talked to a human being who clearly had no idea what they were doing because they copied something and didn't understand it all? If that's the analogy that's being made then I'd say it's a bad one, because it is actually choosing the one time where humans don't understand what they've done as a false equivalence to language models thinking like a human.

Basically, sometimes humans meaninglessly parrot things too.

xigoi · on May 9, 2023

> The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller.

This just means it's a really efficient lossy compression algorithm, not that it learns like a human.

CapsAdmin · on May 8, 2023

> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.

Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.

A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?

ChatGTP · on May 8, 2023

What they are saying is that if you’ve studied computer science , you should be able to write a computer program without storing millions or billions of lines of code from GitHub in your brain.

A CS graduate could workout how to write software without doing that.

So they’re just pointing out the difference in “learning”.

CapsAdmin · on May 8, 2023

LLM's are not storing millions or billions lines of code, and neither do we. Both store something more general and abstract.

But I'm saying there's a big difference between a CS graduate and some current LLM that learns from "the CS curriculum". A CS graduate can ask questions, use google to learn about things outside of school, work on hobby projects, study existing code outside of what's shown in university, get compiler feedback when things go wrong, etc.

All a language model can do is read text and try to predict what comes next.

mirekrusin · on May 8, 2023

We do but we also simulate it doing homework very well.