No, it really isn't. Repeatedly, the case is that people are trying to pass off GPT's work as good without actually verifying the output. I keep seeing "look at this wonderful script GPT made for me to do X", and it does not pass code review, and is generally extremely low quality.
In one example, a bash script was generated to count number SLoC changed by author; it was extremely convoluted, and after I simplified it, I noticed that the output of the simplified version differed, because the original was omitted changes that were only a single line.
In another example it took several back & forths during a review to ask "where are you getting this code? / why do you think this code works, when nothing in the docs supports that?" and after several back and forths, it was admitted that GPT wrote it. The dev who wrote it would have been far better served RTFM, than a several cycle long review that ended up with most of GPT's hallucinations being stripped from the PR.
Those who think LLM's output is good have not reviewed the output strenuously enough.
> Is there some expectation that these things won't improve?
Because randomized token generation inherently lacks actual reasoning about the behavior of the code. My code generator does not.
I think fundamentally if all you do is glue together popular OSS libraries in well understood way, then yes. You may be replaced. But really you probably could be replaced by a Wordpress plugin at that point.
The moment you have some weird library that 4 people in the world know (which happens more than you’d expect) or hell even something without a lot of OSS code what exactly is an LLM going to do? How is it supposed to predict code that’s not derived from its training set?
My experience thus far is that it starts hallucinating and it’s not really gotten any better at it.
I’ll continue using it to generate sed and awk commands, but I’ve yet to find a way to make my life easier with the “hard bits” I want help with.
> I’ll continue using it to generate sed and awk commands,
The first example I gave was an example of someone using an LLM to generate sed & awk commands, on which it failed spectacularly, on everything from the basics to higher-level stuff. The emitted code even included awk, and the awk was poor quality: e.g., it had to store the git log output & make several passes over it with awk, when in reality, you could just `git log | awk`; it was doing `... | grep | awk` which … if you know awk, really isn't required. The regex it was using to work with the git log output it was parsing with awk was wrong, resulting in the wrong output. Even trivial "sane bash"-isms, it messed up: didn't quote variables that needed to be quotes, didn't take advantage of bashisms even though requiring bash in the shebang, etc.
The task was a simple one, bordering on trivial, and any way you cut it, from "was the code correct?" to "was the code high quality?", it failed.
But it shouldn't be terribly surprising that an LLM would fail at writing decent bash: its input corpus would resemble bash found on the Internet, and IME, most bash out there fails to follow best-practice; the skill level of the authors probably follows a Pareto distribution due to the time & effort required to learn anything. GIGO, but with way more steps involved.
I've other examples, such as involving Kubernetes: Kubernetes is also not in the category of "4 people in the world know": "how do I get the replica number from a pod in a statefulset?" (i.e., the -0, -1, etc., at the end of the pod name) — I was told to query,
.metadata.labels.replicaset-序号
(It's just nonsense; not only does no such label exist for what I want, it certainly doesn't exist with a Chinese name. AFAICT, that label name did not appear on the Internet at the time the LLM generated it, although it does, of course, now.) Again, simple task, wide amount of documentation & examples in the training set, and garbage output.
You ask what the LLM is going to do. It’s going to swallow the entire code base in context and allow any developer to join those 4 people in generating production grade code.
> Imagine code passes your rigorous review. How do you know that it wasn’t from an LLM?
That's not what I'm saying (and it's a strawman; yes, presumably some LLM code would escape review and I wouldn't know it's from an LLM, though I find that unlikely, given…) — what I'm saying is of LLM generated code that is reviewed, what is the quality & correctness of the reviewed code? And it's resoundingly (easily >90%) crap.
Obviously we can't sample from unknown-authorship … nor am I; I'm sampling problems that I and others run through an LLM, and the output thereof.
The other facet of this point is that I believe a lot of the craze that users using the LLM have is driven by them not looking closely at the output; if you're just deriving code from the LLM, chucking it over the wall, and calling it a day (as was the case from one of the examples in the comment above) — you're perceiving the LLM as being useful, when it fact it is leaving bugs that you're either not attributing to it, someone else is cleaning up (again, that was the case in the above example), etc.
> what I'm saying is of LLM generated code that is reviewed, what is the quality & correctness of the reviewed code? And it's resoundingly (easily >90%) crap.
What makes you so sure that none of the resoundingly non-crap that you have reviewed was not produced by LLM?
It’s like saying you only like homemade cookies not ones from the store. But you may be gleefully chowing down on cookies that you believe are homemade because you like them (so they must be homemade) without knowing they actually came from the store.
> I'm sampling problems that I and others run through an LLM
This is not what’s happening unless 100% of the problems they’ve sampled (even outside of this fun little exercise) have been run through an LLM.
They’re pretending like it doesn’t matter that they’re looking at untold numbers of other problems and are not aware whether those are LLM generated or not.
No, it really isn't. Repeatedly, the case is that people are trying to pass off GPT's work as good without actually verifying the output. I keep seeing "look at this wonderful script GPT made for me to do X", and it does not pass code review, and is generally extremely low quality.
In one example, a bash script was generated to count number SLoC changed by author; it was extremely convoluted, and after I simplified it, I noticed that the output of the simplified version differed, because the original was omitted changes that were only a single line.
In another example it took several back & forths during a review to ask "where are you getting this code? / why do you think this code works, when nothing in the docs supports that?" and after several back and forths, it was admitted that GPT wrote it. The dev who wrote it would have been far better served RTFM, than a several cycle long review that ended up with most of GPT's hallucinations being stripped from the PR.
Those who think LLM's output is good have not reviewed the output strenuously enough.
> Is there some expectation that these things won't improve?
Because randomized token generation inherently lacks actual reasoning about the behavior of the code. My code generator does not.