Hacker News new | past | comments | ask | show | jobs | submit login

Copilot acts like a search engine, you search, you find, then you judge. It was never the case with search engines that you could just copy some code you found without verifying it. Also, it has the same copyright problems as if you used Google to find the code.



Nice theory. Won’t work out in practice, because this produces code that will run, and it’s AI, so it must be good, right?

When you found code on the internet, it was presented in a context that let you make better judgement (e.g. on Stack Overflow this regular expression would have had a score of roughly −∞ and multiple highly-voted comments saying “do not use this, it’s catastrophically bad”), and where you have to put in more effort to plug it in and shuffle things around a bit as well. With Copilot, you get given ready-to-go code without any sanity checking at all.

See even how, a few seconds later in the video, the author does test it out—but not thoroughly enough.


Decently puts my feelings towards this whole thing

Aside from the beaten horse concerns like licensing... I worry about the training we're giving ourselves and future generations

The upfront presentation of 'suggestions' skews the perception, a fair bit of 'no warranty guaranteed' comes from having to go dig it up


I think Copilot should report the matching source URL to allow the user to visit the page and see the context and license. This move would also placate some copyright questions because it would be like searching StackOverflow or Github for inspiration.

The problem of content attribution (exact and fuzzy match) has been studied before under the task of plagiarism detection for student essays. Funny thing is that a plagiarism detection Copilot would also disclose past cases of copyright violation and cause attribution disputes because code sitting unchecked in various repos would suddenly become visible.


> I think Copilot should report the matching source URL

That's the problem. The output of a GAN like Copilot usually can't be traced directly back to a single input.


If you can't trace the source then it's transformative use. If it matches training data then it needs to report the source like a search engine and place all responsibility on the user.

And fuzzy code matching could be easily implemented by using the a model similar to CLIP (contrastive) to embed code snippets.


> If you can't trace the source then it's transformative use.

That's not how "transformative use" works.


A list sorted by probability would be better than nothing. Would a GAN like Copilot be able to provide that?


Not easily. It's a rather opaque process.


And besides, let's get real, nobody would look at the links - it defeats the purpose of copilot.


> Copilot acts like a search engine, you search, you find, then you judge.

OK, but that’s not how GitHub position it:

“Your AI pair programmer” “Skip the docs and searching for examples”

They literally say it’s not a search engine!


Does GitHub Copilot tell me which license the code it suggested has? If not it's a huge difference to code search engines.


Copilot is not a search engine. It synthesises code so it is unlicensed.


The judge is still out on the licensing issue. And given it can, at least sometimes, output verbatim copy paste, including comments, of well-known GPL code[0], well, let's just say the issue is not clear-cut.

[0]: https://news.ycombinator.com/item?id=27710287


Copilot's api and suggestions could easily (and maybe was?) implemented as a SBQA style model. Using a search engine to find promising examples/context followed by a transformer model to synthesize the final output.

Attribution would clearly be required in such a search derived model.


> Also, it has the same copyright problems as if you used Google to find the code.

No, it has the same copyright problems as if you Google and instead of getting links to sites that host code and licenses, you get just the code.


It's even more gray then that.

Regardless of the licence, does the produced code even quality for copywrite protection, or does it fall under fair use?

what licence, if any, is there for unique code generated by co-pilot etc etc etc.

its a great big ball of who knows, however I expect that noting your only getting snippets you would be highly unlikly to get code that dosent fall under the fair use provisions, that said IANAL


So if we classify AI as search.. and then claim fair use.. we can launder dirty viral code.. how could this go wrong?

But really, if thispersondoesnotexist is just a really good per-pixel search against a corpus of human faces where each “page” of result pixels is organized in a grid presented as a new image with its own metadata..

I mean I guess Google really was an AI company all along.


>Copilot acts like a search engine

No, it doesn't. If my understanding of it is correct, it's an autoencoder, then a few more bits of AI. The MINST dataset is a collection of hand written digits used in many early machine learning classes. Usually they are used to train a classifier, which returns the correct digit given an image. They can also be used to train an autoencoder, which will take an image in, compress it down to far fewer channels, and put out an image that quite closely matches the original.

Once you have an autoencoder, it is easer to input data, and train a neural network to do something with the compressed output. There is no way the autoencoder knows which samples were used to generate the resulting output, it's just optimized at compression.

Thus, Copilot isn't search. You could take the entire corpus it was trained on, and log all the compressed outputs. You could then take a given output before the autoencoder expands it back out, tell which few source code fragments were closest, but there are no guarantees.

TLDR; A far closer analogy: Copilot acts like a Comedian who has stolen a lot of jokes, and can't even remember where they came from.


Just like a real copilot in a car or an airplane shouldn't be trusted? Perhaps they should choose a different name then.


Search engines give you a link where you can (usually) see the code in context, who wrote it, when, license, etc. And often more, like who is using it where, how often it's updated, contact info, test suites, and so on.


> same copyright problems

Is this true? From what I remember reading, the code was uniquely created (but I could be wrong). If that's the case, then does it tel you what license the generated code is under?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: