Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the original Llama paper, the process of preparing the corpus was described. For the sourcecode aspect, it was fed into their model's corpus after being cleaned of boilerplate code. I think it'd be a fair assumption that most of the other vendors followed this practice when creating their datasets.

I'm not a python user, but in most languages, libraries are referenced in (what most devs would consider) boilerplate code. Purely conjecture, but perhaps without boilerplate code, the LLMs are left guessing the names of popular libraries and just merges together two common naming conventions "huggingface" and "-cli".



I think the issue here is that most commandline tools of name X are in an apt/pip package named X.

For huggingface, the tool is called huggingface-cli, but the package is called huggingface_hub[cli].

IMO, thats bad naming. If you make a tool called X, just publish it in a package called X.


In many cases yes, but: What if you publish multiple tools in a single package? What if you publish a package for a web service called X, but also include a tool for it called xcli? What if your package is primarily a library, but you also include an optional cli (like in this case, probably)?

I think what's worse is publishing a package under a different name than its root namespace.


> What if you publish multiple tools in a single package?

I question if one should ever do that... Packages are free... Just make one package per binary, and then a metapackage for those who want to install the whole suite.


my guess is that some layer optimized for the package name being the repo. so the researchers just published a package that matched the repo name.

ai is a bad non-losless search engine.


I think the word you're looking for is lossy


shhh. don't make it easier for the bots.


Boilerplate code is code you don't need to write or is cumbersome to write.

Like getter/setter in java for all your attributes.

I would never consider imports boilerplate code.


If some bits of meaningful (containing essential complexity) code imply other bits with mostly accidental complexity, writing them by hand is a waste of time.

But isn't it also a waste to use data centers full of GPUs to process terabytes of text to accomplish the same thing better programming language design could?


We spend that much GPU compute not to just generate random boiler plate code but to create the real code.

If it wouldn't be beneficial for writing code, which it is, we wouldn't use it.

But yes if we could create better languages or systems, it would be a waste. But we tried multiply new programming languages, we have no code platforms etc.

It does look like though that LLM is still better than all of those approaches.


I have to write very little boilerplate code as it is with the tooling I choose. And a lot of it is generated by scripts using some input from me. I don't need cloud GPUs to write code at all.


I don't need it either.

But thats not the point of it anyway?

Its about writing code faster and potentially better. Cloud GPUs can also generate unit tests etc.

I primarily use it for languages i don't use often enough, nonetheless its only a question of time until it doesn't make sense anymore to write code yourself.


> Its about writing code faster and potentially better.

You an I seem to have different values. I have never desired quicker things that are worse.


Me neither.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: