It does not matter whether Matthew understands how transformer models work. He d...

It does not matter whether Matthew understands how transformer models work. He does understand that legally this is looking messy – or “foggy”, in his words.

As a person that teaches and conducts research with these models, there is no reason why a tool like Copilot has to be a purely parametric, generative transformer like GPT-3. Where attribution, as you describe it, is nearly impossible.

For example, if the model was to use a retriever component to obtain specific pieces of concrete code from a database (where the licensing for each piece is known) conditioned on the original source code context; then generate its output based on this, the context in your source code, and pre-trained parameters, then it could theoretically at least satisfy Butterick’s request and probably be more akin to how a human programmer operates.

This still does not rule out possible legal issues entirely as the pre-trained parameters are opaque, but it certainly makes it less problematic.

Alternatively, there is active research on attributing a given output to a transformer’s training data. But it is still very early days and frankly it is very “foggy” as to what degree this can be done.

Lastly, I have seen a bunch of comments elsewhere alluding to it being somehow sufficient to reduce the level of verbatim copying. Sadly, it will not, as even if you for example replace the variable names it is still a copyright violation. Just like if you manipulate the RGB space slightly for an image. Determining fair use, etc without a license is something that today can only be done in court, regardless of how we of a heavier technical disposition may feel about it. After all, that is exactly why we have explicit licenses in the first place!