> LLM’s never provide code that pass my sniff test
If that statement isn't coming from ego, then where is it coming from? It's provably true that LLM's can generate working code. They've been trained on billions of examples.
Developers seem to focus on the set of cases that LLM's produce code that doesn't work, and use that as evidence that these tools are "useless".
My experience so far has been: if I know what I want well enough to explain it to an LLM then it’s been easier for me to just write the code. Iterating on prompts, reading and understanding the LLM’s code, validating that it works and fixing bugs is still time consuming.
It has been interesting as a rubber duck, exploring a new topic or language, some code golf, but so far not for production code for me.
Okay, but as soon as you need to do the same thing in [programming language you don't know], then it's not easier for you to write the code anymore, even though you understand the problem domain just as well.
Now, understand that most people don't have the same grasp of [your programming language] that you have, so it's probably not easier for them to write it.
> It's provably true that LLM's can generate working code
They produce mostly working code, often with odd design decisions that the bot can't fully justify.
The difficult part of coding is cultivating a mental model of the problem + solution space. When you let an LLM write code for you, your mental model falls behind. You can read the new code closely, internalize it, double-check the docs, and keep your mental model up to date (which takes longer than you think), or you can forge ahead, confident that it all more or less looks right. The second option is easier, faster, and very tempting, and it is the reason why various studies have found that code written with LLM assistance introduces more bugs than code written without.
There are plenty of innovations that have made programming a little bit faster (autocomplete, garbage collection, what-have-you), but none of them were a silver bullet. LLMs aren't either. The essential complexity of the work hasn't changed, and a chat bot can't manage it for you. In my (admittedly limited) experience with code assistants, I've found that the faster you move with an LLM, the more time you have to spend debugging afterwards, and the more difficult that process becomes.
there's a lot more involved in senior dev work beyond producing code that works.
if the stakeholders knew how to do what they needed to build and how, then they could use LLMs, but translating complex requirements into code is something that these tools are not even close to cracking.
> there's a lot more involved in senior dev work beyond producing code that works.
Completely agree.
What I don't agree with is statements like these:
> LLM’s never provide code that pass my sniff test
To me, these (false) absolutions about chat bot capabilities, are being rehashed so frequently, that it derails every conversation about using LLM's for dev work. You'll find similar statements in nearly every thread about LLM's for coding tasks.
It's provably true that LLM's can produce working code. It's also true, that some increasingly large portion of coding is being offloaded to LLM's.
In my opinion, developers need to grow out of this attitude that they are John Henry and they'll outpace the mechanical drilling machine. It's a tired conversation.
> It's provably true that LLM's can produce working code.
You've restated this point several times but the reason it's not more convincing to many people is that simply producing code that works is rarely an actual goal on many projects. On larger projects it's much more about producing code that is consistent with the rest of the project, and is easily extensible, and is readable for your teammates, and is easy to debug when something goes wrong, is testable, and so on.
The code working is a necessary condition, but is insufficient to tell if it's a valuable contribution.
The code working is the bare minimum. The code being right for the project and context is the basic expectation. The code being _good_ at solving its intended problem is the desired outcome, which is a combination of tradeoffs between performance, readability, ease of refactoring later, modularity, etc.
LLM's can sometimes provide the bare minimum. And then you have to refactor and massage it all the way to the good bit, but unlike looking up other people's endeavors on something like Stack Overflow, with the LLM's code I have no context why it "thought" that was a good idea. If I ask it, it may parrot something from the relevant training set, or it might be bullshitting completely. The end result? This is _more_ work for a senior dev, not less.
Hence why it has never passed my sniff test. Its code is at best the quality of code even junior developers wouldn't open a PR for yet. Or if they did they'd be asked to explain how and why and quickly learn to not open the code for review before they've properly considered the implications.
> It's provably true that LLM's can produce working code.
This is correct - but it's also true that LLMs can produce flawed code. To me the cost of telling whether code is correct or flawed is larger than the cost of me just writing correct code. This may be an AuDHD thing but I can better comprehend the correctness of a solution if I'm watching (and doing) the making of that solution than if I'm reading it after the fact.
As a developer, while I do embrace intellisense, I don't copy/paste code, because I find typing it out is a fast path to reflection and finding issues early. Copilot seems to be no better than mindlessly copy/pasting from StackOverflow.
From what I've seen of Copilots, while they can produce working code, I've not seen that much that it offers beyond the surface level which is fast enough for me to type. I am also deeply perturbed from some interviews I've done for senior candidates recently who are using them and, when asked to disable them for collaborative coding task, completely fall apart because of their dependency over knowledge.
This is not to say I do not see value in AI, LLMs or ML (I very much do). However, I code broadly at the speed of thought, and that's not really something I think will be massively aided by it.
At the same time, I know I am an outlier in my practice relative to lots around me.
While I don't doubt other improvements that may come from LLM in development, the current state of the art feels less like a mechanical drill and more like an electric triangle.
Code is a liability, not an asset. It is a necessary evil to create functional software.
Senior devs know this, and factor code down to the minimum necessary.
Junior devs and LLMs think that writing code is the point and will generate lots of it without worrying about things like leverage, levels of abstraction, future extensibility, etc.
The code itself, whether good or bad, is a liability. Just like a car is a liability, in a perfect world, you'd teleport yourself to your destination, instead you have to drive. And because of that, roads and gas stations have to be built, you have to take care of the car, etc,.... It's all a huge pain. The code you write, you will have to document, maintain, extend, refactor, relearn, and a bunch of other activities. So yo do your best to only have the bare minimum to take care of. Anything else is just future troubles.
Sure, I don’t dispute any of that. But it's not a given that using LLMs means you’re going to have unnecessary code. They can even help to reduce the amount of code. You just have to be detailed in your prompting about what you do and don’t want, and work through multiple iterations until the result is good.
Of course if you try to one shot something complex with a single line prompt, the result will be bad. This is why humans are still needed and will be for a long time imo.
I'm not sure that's true. A LLM can code because it is trained on existing code.
Empirically, LLMs work best at coding when doing completely "routine" coding tasks: CRUD apps, React components, etc. Because there's lots of examples of that online.
I'm writing a data-driven query compiler and LLM code assistance fails hard, in both blatant and subtle ways. There just isn't enough training data.
Another argument: if a LLM could function like a senior dev, it could learn to program in new programming languages given the language's syntax, docs and API. In practice they cannot. It doesn't matter what you put into the context, LLMs just seem incapable of writing in niche languages.
Which to me says that, at least for now, their capabilities are based more on pattern identification and repetition than they are on reasoning.
Have you tried new languages or niche languages with claude sonnet 3.5? I think if you give it docs with enough examples, it might do ok. Examples are crucial. I’ve seen it do well with CLI flags and arguments when given docs, which is a somewhat similar challenge.
That said, you’re right of course that it will do better when there’s more training data.
> It's provably true that LLM's can produce working code
ChatGPT, even now in late-2024, still hallucinates standard-library types and methods more-often-than-not whenever I ask it to generate code for me. Granted, I don’t target the most popular platforms (i.e. React/Node/etc; I’m currently in a .NET shop, which is a minority platform now, but ChatGPT’s poor performance is surprising given the overall volume and quality of .NET content and documentation out there.
My perception is that “applications” work is more likely to be automated-away by LLMs/copilots because so much of it is so similar to everyone else’s, so I agree with those who say LLMs are only as good as there are examples of something online, whereas asking ChatGPT to write something for a less-trodden area, like Haskell or even a Windows driver, is frequently a complete waste of time as whatever it generates is far beyond salvaging.
Beyond hallucinations, my other problem lies in the small context window which means I can’t simply provide all the content it needs for context. Once a project grows past hundreds of KB of significant source I honestly don’t know how us humans are meant to get LLMs to work on them. Please educate me.
I’ll declare I have no first-hand experience with GitHub Copilot and other systems because of the poor experiences I had with ChatGPT. As you’re seemingly saying that this is a solved problem now, can you please provide some details on the projects where LLMs worked well for you? (Such as which model/service, project platform/language, the kinds of prompts, etc?). If not, then I’ll remain skeptical.
> still hallucinates standard-library types and methods more-often-than-not whenever I ask it to generate code for me
Not an argument, unsolicited advice: my guess is you are asking it to do too much work at once. Make much smaller changes. Try to ask for as roughly much as you would put into one git commit (per best practices)-- for me that's usually editing a dozen or less lines of code.
> Once a project grows past hundreds of KB of significant source I honestly don’t know how us humans are meant to get LLMs to work on them. Please educate me.
Edit: The author of aider puts the percentage of the code written by LLMs for each release. It's been 70%+. But some problems are still easier to handle yourself. https://github.com/Aider-AI/aider/releases
Thank you for your response - I've aksed these questions before in other contexts but never had a reply, so pretty-much any online discussion about LLMs feels like I'm surrounded by people role-playing being on LinkedIn.
> It's provably true that LLM's can produce working code
Then why can't I see this magical code that is produced? I mean a real big application with a purpose and multiple dependencies, not yet another ReactJS todo list. I've seen comments like that a hundred times already but not one repository that could be equivalent to what I currently do.
For me the experience of LLM is a bad tool that calls functions that are obsolete or do not exist at all, not very earth-shattering.
> if the stakeholders knew how to do what they needed to build and how, then they could use LLMs, but translating complex requirements into code is something that these tools are not even close to cracking.
They don't have to replace you to reduce headcount. They could increase your workload so where they needed five senior developers, they can do with maybe three. That's like six one way and half a dozen the other way because two developers lost a job, right?
Yeah. Code that works is a fraction of the aim. You also want code that a good junior can read and debug in the midst of a production issue, is robust against new or updated requirements, has at least as good performance as the competitors, and uses appropriate libraries in a sparse manner. You also need to be able to state when a requirement would loosen the conceptual cohesion of the code, and to push back on requirements thdt can already be achieved in just as easy a way.
> It's provably true that LLM's can generate working code.
What I've seen of them, the good ones mostly produce OK code. Not terrible, usually works.
Although I like them even for that low-ish bar, although I find them to be both a time-saver and a personal motivation assistant, they're still a thing that needs a real domain expert to spot the mistakes they make.
> Developers seem to focus on the set of cases that LLM's produce code that doesn't work, and use that as evidence that these tools are "useless".
I do find it amusing how many humans turned out to be stuck thinking in boolean terms, dismissing the I in AGI, calling them as "useless" because it "can't take my job". Same with the G in AGI, dismissing the breadth of something that speaks 50 languages when humans who speak five or six languages are considered unusually skilled.
> If that statement isn't coming from ego, then where is it coming from? It's provably true that LLM's can generate working code. They've been trained on billions of examples.
I am pro AI and I'm probably even overvaluing the value AI brings. However, for me, this doesn't work in more "esoteric" programming languages or those with stricter rulesets like Rust. LLMS provide fine JS code, since there's no compiler to satisfy, but CPP without undefined behaviour or compiling Rust code is rare.
There's also no chance of LLMS providing compiling code if you're using a library version with a newer API than the one in the training set.
Working code some subset of the time, for popular languages. It’s not good at Rust nor at other smaller languages. Tried to ask it for some help with Janet and it was hopelessly wrong, even with prompting to try to get it to correct its mistakes.
Even if it did work, working code is barely half the battle.
> It's provably true that LLM's can generate working code.
Yeah for simple examples, especially in web dev. As soon as you step outside those bounds they make mistakes all the time.
As I said, they're still useful because roughly correct but buggy code is often quite helpful when you're programming. But there's zero chance you can just say "write me an driver for the nRF905 using Embassy and embedded-hal" and get something working. Whereas I, a human, can do that.
How confident are you in the correctness of that code?
Because, one way or another, you're still going to need to become fluent enough in the problem domain & the API that you can fully review the implementation and make sure chatgpt hasn't hallucinated in any weird problems. And chatgpt can't really explain its own work, so if anything seems funny, you're going to have to sleuth it out yourself.
And at that point, it's just 339 lines of code, including imports, comments, and misc formatting. How much time have you really saved?
If that statement isn't coming from ego, then where is it coming from? It's provably true that LLM's can generate working code. They've been trained on billions of examples.
Developers seem to focus on the set of cases that LLM's produce code that doesn't work, and use that as evidence that these tools are "useless".