Copilot acts like a search engine, you search, you find, then you judge. It was never the case with search engines that you could just copy some code you found without verifying it. Also, it has the same copyright problems as if you used Google to find the code.
Nice theory. Won’t work out in practice, because this produces code that will run, and it’s AI, so it must be good, right?
When you found code on the internet, it was presented in a context that let you make better judgement (e.g. on Stack Overflow this regular expression would have had a score of roughly −∞ and multiple highly-voted comments saying “do not use this, it’s catastrophically bad”), and where you have to put in more effort to plug it in and shuffle things around a bit as well. With Copilot, you get given ready-to-go code without any sanity checking at all.
See even how, a few seconds later in the video, the author does test it out—but not thoroughly enough.
I think Copilot should report the matching source URL to allow the user to visit the page and see the context and license. This move would also placate some copyright questions because it would be like searching StackOverflow or Github for inspiration.
The problem of content attribution (exact and fuzzy match) has been studied before under the task of plagiarism detection for student essays. Funny thing is that a plagiarism detection Copilot would also disclose past cases of copyright violation and cause attribution disputes because code sitting unchecked in various repos would suddenly become visible.
If you can't trace the source then it's transformative use. If it matches training data then it needs to report the source like a search engine and place all responsibility on the user.
And fuzzy code matching could be easily implemented by using the a model similar to CLIP (contrastive) to embed code snippets.
The judge is still out on the licensing issue. And given it can, at least sometimes, output verbatim copy paste, including comments, of well-known GPL code[0], well, let's just say the issue is not clear-cut.
Copilot's api and suggestions could easily (and maybe was?) implemented as a SBQA style model. Using a search engine to find promising examples/context followed by a transformer model to synthesize the final output.
Attribution would clearly be required in such a search derived model.
Regardless of the licence, does the produced code even quality for copywrite protection, or does it fall under fair use?
what licence, if any, is there for unique code generated by co-pilot etc etc etc.
its a great big ball of who knows, however I expect that noting your only getting snippets you would be highly unlikly to get code that dosent fall under the fair use provisions, that said IANAL
So if we classify AI as search.. and then claim fair use.. we can launder dirty viral code.. how could this go wrong?
But really, if thispersondoesnotexist is just a really good per-pixel search against a corpus of human faces where each “page” of result pixels is organized in a grid presented as a new image with its own metadata..
I mean I guess Google really was an AI company all along.
No, it doesn't. If my understanding of it is correct, it's an autoencoder, then a few more bits of AI. The MINST dataset is a collection of hand written digits used in many early machine learning classes. Usually they are used to train a classifier, which returns the correct digit given an image. They can also be used to train an autoencoder, which will take an image in, compress it down to far fewer channels, and put out an image that quite closely matches the original.
Once you have an autoencoder, it is easer to input data, and train a neural network to do something with the compressed output. There is no way the autoencoder knows which samples were used to generate the resulting output, it's just optimized at compression.
Thus, Copilot isn't search. You could take the entire corpus it was trained on, and log all the compressed outputs. You could then take a given output before the autoencoder expands it back out, tell which few source code fragments were closest, but there are no guarantees.
TLDR; A far closer analogy: Copilot acts like a Comedian who has stolen a lot of jokes, and can't even remember where they came from.
Search engines give you a link where you can (usually) see the code in context, who wrote it, when, license, etc. And often more, like who is using it where, how often it's updated, contact info, test suites, and so on.
Is this true? From what I remember reading, the code was uniquely created (but I could be wrong). If that's the case, then does it tel you what license the generated code is under?