In the original Llama paper, the process of preparing the corpus was described. For the sourcecode aspect, it was fed into their model's corpus after being cleaned of boilerplate code. I think it'd be a fair assumption that most of the other vendors followed this practice when creating their datasets.
I'm not a python user, but in most languages, libraries are referenced in (what most devs would consider) boilerplate code. Purely conjecture, but perhaps without boilerplate code, the LLMs are left guessing the names of popular libraries and just merges together two common naming conventions "huggingface" and "-cli".
In many cases yes, but: What if you publish multiple tools in a single package? What if you publish a package for a web service called X, but also include a tool for it called xcli? What if your package is primarily a library, but you also include an optional cli (like in this case, probably)?
I think what's worse is publishing a package under a different name than its root namespace.
> What if you publish multiple tools in a single package?
I question if one should ever do that... Packages are free... Just make one package per binary, and then a metapackage for those who want to install the whole suite.
If some bits of meaningful (containing essential complexity) code imply other bits with mostly accidental complexity, writing them by hand is a waste of time.
But isn't it also a waste to use data centers full of GPUs to process terabytes of text to accomplish the same thing better programming language design could?
We spend that much GPU compute not to just generate random boiler plate code but to create the real code.
If it wouldn't be beneficial for writing code, which it is, we wouldn't use it.
But yes if we could create better languages or systems, it would be a waste. But we tried multiply new programming languages, we have no code platforms etc.
It does look like though that LLM is still better than all of those approaches.
I have to write very little boilerplate code as it is with the tooling I choose. And a lot of it is generated by scripts using some input from me. I don't need cloud GPUs to write code at all.
Its about writing code faster and potentially better. Cloud GPUs can also generate unit tests etc.
I primarily use it for languages i don't use often enough, nonetheless its only a question of time until it doesn't make sense anymore to write code yourself.
I'm not a python user, but in most languages, libraries are referenced in (what most devs would consider) boilerplate code. Purely conjecture, but perhaps without boilerplate code, the LLMs are left guessing the names of popular libraries and just merges together two common naming conventions "huggingface" and "-cli".