Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Later in the thread he stated the code was not on the machine he tested copilot with.

Copilot training data should have been sanitized better.

In addition: any code that is produced by copilot that uses a source that is licensed, MUST follow the practices of that license, including copyright headers.



Right - but if someone pushes the same code to github and changes the licence file to say "public domain", what's the legally correct way to proceed? What's the morally correct way to proceed?


Legally, if you're publishing a derived work without legitimate permission then you're civilly liable for statutory + actual damages, the only thing you're avoiding is the treble damages for wilful infringement.

Morally I'd say you should make a reasonable good faith effort to verify that you have a real license for everything you're using. When you're importing something on the scale of "all of Github" that means a bit more effort than just blindly trusting the file in the repository. When I worked with an F500 we would have a human explicitly review the license of each dependency; the review was pretty cursory, but it would've been enough to catch someone blatantly ripping off a popular repo.


How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

What if a particular piece of code is licensed restrictively, and then (assuming without malice) accidentally included in a piece of software with a permissive license?

What if a particular piece of code is licensed permissively (in a way that allows relicensing, for example), but then included in a software package with a more restrictive licence. How could you tell if the original code is licensed permissively or not?

At what point do Github have to become absolute arbiters of the original authorship of the code in order to determine who is authorised to issue licenses for the code? How would they do so? How could you prove ownership to Github? What consequences could there be if you were unable to prove ownership?

That's before we even get to more nuanced ethical questions like a human learning to code will inevitably learn from reading code, even if the code they read is not permissively licensed. Why then, would an AI learning to code not be allowed to do the same?


The “it’s really hard” argument isn’t a very good argument in my opinion?

If we hold reproductions of a single repository to a certain standard, the same standard should probably apply to mass reproductions. For a single repository, it’s your responsibility to make sure it’s used according to the license.

Are there slightly gray edge cases? Of course, but they’re not -that- grey. If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

If something is prohibitively difficult maybe we should sometimes consider that more work is required to enable the circumstances for it to be a good idea, rather than starting from the position that we should do it and moulding what we consider reasonable around that starting assumption.


If someone uploads something and says 'hey, this is some code, this is the appropriate licence for it', it is their mistake, it is in violation of Github's terms of service, and may even be fraudulent. [0].

I'm also not sure that Copilot is just reproducing code, but that's a separate discussion.

> If I reproduced part of a book from a source that claimed incorrectly it was released under a permissive license, I would still be liable for that misuse. Especially if I was later made aware of the mistake and didn’t correct it.

I don't believe that's correct in the first instance (at least from a criminal perspective). If someone misrepresents to you that they have the right to authorise you to publish something, and it turns out they don't have that right, you did not willingly infringe and are not liable for the infringement from a criminal perspective[1]. From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

This is my own code, I wrote it myself just now. Can I copyright it?

``` function isOdd (num) { if (num % 2 === 0) { return true; } else { return false; } } ```

What about the following:

``` function isOddAndNotSunday (num) { const date = new Date(); if (num % 2 === 0 && date.getDay() > 0) { return true; } else { return false; } } ```

Where do we draw the line?

[0]: https://docs.github.com/en/site-policy/github-terms/github-t... [1]: https://www.law.cornell.edu/uscode/text/17/506


> From a civil perspective, likely the copyright owner could still claim damages from you if you were unable to reach a settlement. A court would probably determine the damages to award based on real damages (including loss of earnings for the content creator), rather than anything punitive if it's found that

There are statutory damages on top of your actual damages. $50k per act of infringement. No reason for the copyright holder to settle for less when it's an open and shut case.

> Further, most jurisdictions have exceptions for short extracts of a larger copyrighted work (e.g. quotes from a book), which may apply to Copilot.

Quotes do not automatically get an exception just because they're taken from a larger work, they might be excepted either because they were de minimis (essentially because they were too short to be copyrightable) or because they were fair use (which is a complex question that takes into account the purpose and context, which Copilot is very unlikely to satisfy because it's not quoting other code for the purpose of saying something about it).

> Where do we draw the line?

Circuit specific; some but not all circuits use the AFC test. It sounds like this code was both long enough and creative/innovative enough to be well on the wrong side of it though.


I am not sure about statutory damages.

As I understand it, the complainant may CHOOSE to request the court to levy statutory damages rather than actual damages at any point, but is not entitled to both actual AND statutory (17 U.S. Code § 504)

It also seems to be absolutely capped at 30K per infringement, not 50, and ranges up from $750. It also seems that if the "court finds, that such infringer was not aware and had no reason to believe that his or her acts constituted an infringement of copyright, the court in its discretion may reduce the award of statutory damages to a sum of not less than $200."

I think you are probably right that this specific function is copyrightable though, but taken overall, I think Microsoft's lawyers have probably concluded that they would win any challenge on this. Microsoft have lost court battles before though, so who knows?


Your question can actually be answered legally. I'm not a lawyer so I'm not going to tell you what those answers are, but there are pretty well established mechanisms to determine if a function is trivial enough to warrant being copyrighted (a lot of this was explored in the SCO vs. IBM saga)


> How do you know GH didn't? Maybe they only included repos with LICENSE.MD files which followed a known permissive licence?

Since copilot famously outputs GPL covered code… no, we have proof they didn't do that.


I think you've missed my point.

If you write some code and release it under the GPL. Then I take your code, integrate it into my project, and release my project with the MIT licence (for example), it may be that Copilot was only trained on my repo (with the MIT licence)

The fault there is not on Github, it's on me. I was the one who incorrectly used your code in a way that does not conform to the terms of your licence to me.

I don't think the fact that Copilot outputs code which seems to be covered under the GPL proves that Github did not only crawl repositories with permissive licences when training Copilot.


It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish. The Linux kernel generally requires non anonymous contributions for that reason. As a guarantee that the person has the right to contribute.


> It is responsibility of the entity (Microsoft in this case) publishing the code to make sure that they have the right to publish.

This would basically kill github as an idea. I like the ability to be able to push some personal project to github and don't really give a fuck about technical copyright violations and I think the same is true for 90% of developers.


If you want a massive corpus of training data theb you can create it by hand like grandpappy used to do rather than just thieving it whilst telling yourself it is fine.


You keep track of where each external dependency, file and code snippet come from, link to the source, link to the source license.

If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

If you have licensed code in your software and no license to show for it or cannot produce the link to it then you're on the hook.

And here's the issue at hand copilot must have seen that code under permissive license somewhere, but now cannot produce a link to it.


> If someone has lied about the license of something down the chain of links, he's the one on the hook for it.

In this case, all you have on them is an email address. Pretty sure you're still on the hook.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: