Yeah. I just need to babysit it too much. Take copilot, it gives good suggestions and blows me away sometimes with a block of code which is exactly what I'd type. But actively letting it code (at least with gpt4.1 or gpt4o) just doesn't work well enough for me. Half of the time it doesn't even compile, and after fixing that it's just not really correctly working either. I'd expect it to work like a very junior programmer, but it works like a very drunk senior programmer that isn't listening to you very well at all.
Yes yes yes we're all aware that these are word predictors and don't actually know anything or reason. But these random dice are somehow able to give reasonably seemingly well-educated answers a majority of the time and the fact that these programs don't technically know anything isn't going to slow the train down any.
i just don't get why people say they don't reason. It's crazy talk. the kv cache is effectively a unidirectional turing machine so it should be possible to encode "reasoning" in there. and evidence shows that llms occasionally does some light reasoning. just because it's not great at it (hard to train for i suppose) doesn't mean it does it zero.
Would I be crazy to say that the difference between reasoning and computation is sentience? This is an impulse with no justification but it rings true to me.
Taking a pragmatic approach, I would say that if the AI accomplishes something that, for humans, requires reasoning, then we should say that the AI is reasoning. That way we can have rational discussions about what the AI can actually do, without diverting into endless discussions about philosophy.
Suppose A solves a problem and writes the solution down. B reads the answer and repeats it. Is B reasoning, when asked the same question? What about one that sounds similar?
The crux of the problem is "what is reasoning?" Of course it's easy enough to call the outputs "equivalent enough" and then use that to say the processes are therefore also "equivalent enough."
I'm not saying it's enough for the outputs to be "equivalent enough."
I am saying that if the outputs and inputs are equivalent, then that's enough to call it the same thing. It might be different internally, but that doesn't really matter for practical purposes.
I think one of the great lessons of our age will be that things being apparently equivalent, or in more applied terms "good enough," are not equal to equality.
In my experience PhD's are not 10x productive. Quite the opposite actually. Too much theory and not much practicality. The only two developers that my company has fired for (basically) incompetency were PhD's in Computer Science. They couldn't deliver practical real code.
"Ketamine has been found to increase dopaminergic neurotransmission in the brain"
This property is likely an important driver of ketamine abuse and it being rather strongly 'moreish', as well as the subjective experiences of strong expectation during a 'trip'. I.e. the tendency to develop redose loops approaching unconsciousness in a chase to 'get the message from the goddess' or whatever, which seems just out of reach (because it's actually a feeling of expectation and not actually a partially installed divine T3 rig).
The “multiple PhDs” thing is interesting. The point of a PhD is to master both a very specific subject and the research skills needed to advance the frontier of knowledge in that area. There’s also plenty of secondary issues, like figuring out the politics of academia and publishing enough to establish a reputation.
I don’t think models are doing that. They certainly can retrieve a huge amount of information that would otherwise only be available to specialists such as people with PhDs… but I’m not convinced the models have the same level of understanding as a human PhD.
It’s easy to test though- the models simply have to write and defend a dissertation!
Totally disagree. The current state of coding AIs is “a level 2 product manager who is a world class biker balancing on a unicycle trying to explain a concept in French to a Spanish genius who is only 4 years old.” I’m not going to explain what I mean, but if you’ve used Qwen Code you understand.
Qwen Code is really not representative of the state of the art though. With the right prompt I have no problem getting Claude to output me a complete codebase (e.g. a non trivial library interfacing with multiple hardware devices) with the specs I want, in modern c++ that builds, runs, has documentation and unit tests sourced from data sheets and manufacturer specs from the go
Assuming there aren't tricky concurrency issues and the documentation makes sense (you know what registers to set to configure and otherwise work the device,) device drivers are the easiest thing in the world to code.
There's the old trope that systems programmers are smarter than applications programmers but SWE-Bench puts the lie to that. Sure, SWE-Bench problems are all in the language of software, applications programmers take badly specified tickets in the language of product managers, testers and end users and have to turn that into the language of SWE-Bench to get things done. I am not that impressed with 65% performance on SWE-Bench because those are not the kind of tickets that I have to resolve at work, but rather at work if I want to use AI to help maintain a large codebase I need to break the work down into that kind of ticket.
> device drivers are the easiest thing in the world to code.
Except the documentation lies and in reality your vendor shipped you a part with timing that is slightly out of sync with what the doc says and after 3 months of debugging, including using an oscilloscope, you figure out WTF is going on. You report back to your supplier and after two weeks of them not saying any thing they finally reply that the timings you have reverse engineered are indeed the correct timings, sorry for any misunderstandings with the documentation.
As an application's engineer, my computer doesn't lie to me and memory generally stays at a value I set it to unless I did something really wrong.
Backend services are the easiest thing in the world to write, I am 90% sure that all the bullshit around infra is just artificial job security, and I say this as someone who primarily does backend work now days.
I'm not sure if this counts as systems or application engineering, but if you think your computer doesn't lie to you, try writing an nginx config. Those things aren't evaluated at /all/ the way they look like they are.
At no point have any of my nginx files ever flipped their own bits.
Are they a constant source of low level annoyance? Sure. But I've never had to look at a bus timing diagram to understand how to use one, nor worried about an nginx file being rotated 90 degrees and wired up wrong!
To some extent, for sure. The fact that electronics engineers that have picked up a bit of software write a large fraction of the world's device drivers does point to it not being the most challenging of software tasks, but on the other hand the real 'systems engineering' is writing the code that lets those engineers do so successfully, which I think is quite an impressive feat.
I was joking! Claude Code is still the best afaik, though I’d compare it more to “sending a 1440p HDR fax of your user story to a 4-armed mime whose mind is then read by a Aztec psychic who has taken just the right amount of NyQuil.”
Probably the saddest comment I've read all day. Crafting software line-by-line is the best part of programming (maybe when dealing with hardware devices you can instead rely on auto-generated code from the register/memory region descriptions).
How long would that be economically viable when a sufficient number of people can generate high-qualify code in 1/10th the time? (Obviously, it will always be possible as a hobby.)
> But actively letting it code (at least with gpt4.1 or gpt4o)
It's funny, Github Copilot puts these models in the 'bargin bin' (they are free in 'ask' mode, whereas the other models count against your monthly limit of premium requests) and it's pretty clear why, they seem downright nerfed. They're tolerable for basic questions but you wouldn't use them if price weren't a concern.
Brandwise, I don't think it does OpenAI any favors to have their models be priced as 'worthless' compared to the other models on premium request limits.
With something like Devin, where it integrates directly with your repo and generates documentation based on your project(s), it's much more productive to use as an agent. I can delegate like 4-5 small tasks that would normally take me a full day or two (or three) of context switching and mental preparation, and knock them out in less than a day because it did 50-80% of the work, leaving only a few fixes or small pivot for me to wrap them up.
This alone is where I get a lot of my value. Otherwise, I'm using Cursor to actively solve smaller problems in whatever files I'm currently focused on. Being able to refactor things with only a couple sentences is remarkably fast.
The more you know about your language's features (and their precise names), and about higher-level programming patterns, the better time you'll have with LLMs, because it matches up with real documentation and examples with more precision.
> Being able to refactor things with only a couple sentences is remarkably fast.
I'm curious, this is js/ts? Asking because depending on the lang, good old machine refactoring is either amazeballs (Java + IDE) or non-existent (Haskell).
I'm not js/ts so I don't know what the state of machine refactoring is in VS code ... But if it's as good as Java then "a couple of sentences" is quite slow compared to a keystroke or a quick dialog box with completion of symbol names.
I'm using TypeScript. In my case, these refactors are usually small and only spanning up to 5 files depending on how interdependent things are. The benefit with an Agent is it's ability to find and detect related side effects caused by the refactor (broken type-safety, broken translation strings, etc.) and renaming for related things, like an actual UI string if it's tied to the naming of what I'm working on, and my changes happened to include a rename.
It's not always right, but I find it helpful when it finds related changes that I should be making anyway, but may have overlooked.
Another example: selecting a block that I need to wrap (or unwrap) with tedious syntax, say I need to memoize a value with a React `useMemo` hook. I can select the value, open Quick Chat, type "memoize this", and within milliseconds it's correctly wrapped and saved me lots of fiddling on the keyboard. Scale this to hundreds of changes like these over a week, it adds up to valuable time-savings.
Even more powerful: selecting 5, 10, 20 separate values and typing: "memoize all of these" and watching it blast through each one in record time with pinpoint accuracy.
IntelliJ has keyboard shortcuts for all of these. I think how impressed you are by AI depends a lot on the quality of the tooling you were previously working with.
Work is. I actually don't have access to our billing, so I couldn't tell you exactly, but it depends on how many ACUs (Agent Compute Units) you've used.
We use a Team plan ($500 /mo), which includes 250 ACUs per month. Each bug or small task consumes anywhere between 1-3 ACUs, and fewer units are consumed if you're more precise with your prompt upfront. A larger prompt will usually use fewer ACUs because follow-up prompts cause Devin to run more checks to validate its work. Since it can run scripts, compilers, linters, etc. in its own VM -- all of that contributes to usage. It can also run E2E tests in a browser instance, and validate UI changes visually.
They recommend most tasks should stay under 5 ACUs before it becomes inefficient. I've managed to give it some fairly complex tasks while staying under that threshold.
>I'd expect it to work like a very junior programmer, but it works like a very drunk senior programmer that isn't listening to you very well at all.
Best analogy I've ever heard and it's completely accurate. Now, back to work debugging and finishing a vibe coded application I'm being paid to work on.
I think there are three factors to this: 1. What to code (longer, more specific prompts are better but take longer to write), and 2. How to code it (specify languages, libraries, APIs, etc.) And if you're trying to write code that uses a newer version of a library that works differently from what's most commonly documented, it's a long uphill battle of constantly reminding the LLM of the new changes.
If you're not specific enough, it will definitely spit out a half-baked pseudocode file where it expects you to fill in the rest. If you don't specify certain libraries, it'll use whatever is featured in the most blogspam. And if you're in an ecosystem that isn't publicly well-documented, it's near useless.
Two other observations I've found working with ChatGPT and Copilot:
First, until I can re-learn boundaries, they are a fiasco for work-life balance. It's way too easy to have a "hmm what if X" thought late at night or first thing in the morning, pop off a quick ticket from my phone, assign to Copilot, and then twenty minutes later I'm lying in bed reviewing a PR instead of having a shower, a proper breakfast, and fully entering into work headspace.
And on a similar thread, Copilot's willingness to tolerate infinite bikeshedding and refactoring is a hazard for actually getting stuff merged. Unlike a human colleague who loses patience after a round or two of review, Copilot is happy to keep changing things up and endlessly iterating on minutiae. Copilot code reviews are exhausting to read through because it's just so much text, so much back and forth, every little change with big explanations, acknowledgments, replies, etc.
I've found this with Claude Code too. It has nonstop energy (until you run out of tokens) and is always a little too eager to make random edits, which means it's somehow very tiring to use even though you're not doing anything.
But it is the most productive intern I've ever pair programmed with. The real ones hallucinate about as often too.
if I want to throw a shuriken abiding to some artificial, magic Magnus force like in the movie wanted, both chatGpt and Claude let me down, using pygame. what if I wanted c-level performance or if I wanted to use zig? burp.
It works like the average Microsoft employee, like some doped version of an orange wig wearer who gets votes because his daddys kept the population as dumb as it gets after the dotcom x Facebook era. in essence, the ones to be disappointed by are the Chan-Zuckerbergs of our time. there was a chance, but there also was what they were primed for
What does it really mean to know something or understand something. I think AI knows a great deal (associating facts with symbols), confabulates at times when it doesn't know (which is dishonestly called hallucination, implying a conscious agent misperceiving, which AI is not), and understands almost nothing.
The best way to think of chat bot "AI" is as the compendium of human intelligence as recorded in books and online media available to it. It is not intelligent at all on its own and its judgement can't be better than its human sources because it has no biological drive to sythesize and excel. Its best to think of AI as a librarian of human knowledge or an interactive Wikipedia which is designed to seem like an intelligent agent but is actually not.
One cannot learn everything from books and in any case many books contradict each other so every developer is a variation based on what they have read and experienced and thought along the way. How can that get summed up into one thing? It might not even be useful to do that.
I suspect that some researchers with a very different approach will come up with a neural network that learns and works more like a human in future though. Not the current LLMS but something with a much more efficient learning mechanism that doesn't require a nuclear power station to train.
What is baffling to me is how otherwise intelligent people don't really understand what human intelligence and learning are about. They are about a biological organism following its replication algorithm. Why should a computer program learn and work like a biological organism if it is in an entirely different environment with entirely different drives?
Intelligence is not some universal abstract thing acheivable after a certain computational threshold is reached. Rather its a quality of the behavior patterns of specific biological organisms following their drives.
...because so far only our attempts to copy nature have proven successful...in that we have judged the result "intelligent".
There's a long history in AI where neural nets were written off as useless (Minsky was the famous destroyer of the idea, I think) and yet in the end they blew away the alternatives completely.
We have something now that's useful in that it is able to glom a huge amount of knowledge but the cost of doing so it tremendous and therefore in many ways it's still ridiculously inferior to nature because it's only a partial copy.
A lot of science fiction has assumed that robots, for example, would automatically be superior to humans - but are robots self-repairing or self replicating? I was reading recently about how the reasons why many developers like python are the reasons why it can never be made fast. In other words you cannot have everything - all features come at a cost. We will probably have less human and more human AIs because they will offer us different trade offs.
To date, I've not been able to effectively use Copilot in any projects.
The suggestions were always unusably bad. The /fix were always obviously and straight up false unless it was a super silly issue.
Claude Code with Opus model on the other hand was mind-blowing to me and made me change my mind on almost everything wrt my opinion of LLMs for coding.
You still need to grow the skill of how to build the context and formulate the prompt, but the buildin execution loop is a complete game changer and I didn't realize that until I actually used it effectively on a toy project myself.
MCP in particular was another thing I always thought was massively over hyped, until I actually started to use some in the same toy project.
Frankly, the building blocks already exist at this point to make a vast majority of all jobs redundant (and I'm thinking about all grunt work office jobs, not coding in particular). The tooling still need to be created, so I'm not seeing a short term realization (<2 yrs), but medium term (5+yrs)?
You should expect most companies to let people go at staggering numbers, with only small amounts of highly skilled people left to administer the agents
> You should expect most companies to let people go at staggering numbers, with only small amounts of highly skilled people left to administer the agents
I don't buy that. The linked article makes a solid argument for why that's not likely to happen: agentic loop coding tools like Claude Code can speed up the "writing code and getting it working" piece, but the software development lifecycle has so much other work before you get to the "and now we let Claude Code go brrrrrrr" phase.
These are exactly the people that are going to stay, medium term.
Let's explore a fictional example that somewhat resembles my, and I suspect a lot of peoples current dayjob.
A Micro-Service architecture, each team administers 5-10 services and the whole application, which is once again only a small part of the platform as a whole is developed by maybe 100-200 devs. So something like ~200 micro services
The application architects are gonna be completely save in their jobs. And so are the lead devs in each team - at least from my perspective. Anyone else? I suspect MBAs in 5 yrs will not see their value anymore. That's gonna be the vast majority of all devs, that's likely going to cost 50% of the devs their jobs. And middle management will be slimmed down just as quickly, because you suddenly need a lot less managers.
Let’s extreme this further - why would the company exist in the first place? The customers of said company pay them because they don’t do the service themselves - but in the future when it’s laughably easy to vibe code anything your heart desires, their customers will just build the service themselves that they used to outsource!
tl;dr: in the future when vibe coding works 100% of the time, logically the only companies that will exist are the ones that have processes that AI can’t do, because all the other parts of the supply chain can all be done in-house
That scenario is a lot further out compared to what I was talking about.
It's conceivable that thats going to happen, eventually. but that'd likely require models a lot more advanced to what we have now.
The agent approach with lead devs administering and merging the code the agents made is feasible with today's models. The missing part is the tooling around the models and the development practices that that standardizes this workflow.
That's what I'd expect to take around 5 yrs to settle.
Thanks for this perspective, but I am a bit confused by some of your takes: you used "Claude Code with Opus model" in "the same toy project" with great success, which led you to conclude that this will "make a vast majority of all jobs redundant".
Toy project viability does not connect with making people redundant in the process (ever, really) — at least not for me. Care to elaborate where do you draw the optimism from?
I cannot use it on my production code base. I'm working for a company that requires the devs to code from virtual workplaces, which is a fancy term to say virtual machines running in the azure cloud. These are completely locked down and anything but copilot is forbidden from use, and enforced via firewall and process monitoring. I can still use sonnet 3.7 through that, but that's a far cry from my experience on my personal time with Claude Code.
I called it a toy project because I'm not earning money with it - hence it's a toy.
It does have medium complexity with roughly 100k loc though.
And I think I need to repeat myself, because you seem to read something into my comment that I didn't say: the building blocks exist doesn't mean that today's tooling is sufficient for this to play out, today.
I did not miss the time horizon: this is why I put a remark of "ever, really".
"Toy project" is usually used in a different context (demonstrate something without really doing something useful): yours sounds more like a "hobby project".
That's a good point. Ive actually implemented the same project over 20 times at this point.
At the heart is my hobby of reading web and light novels. I've been implementing various versions of a scraper and ePub reader for over 15 years now, ever since I started working as a programmer.
I've been reimplementing it over the years with the primary goal of growing my experiences/ability. In the beginning it was a plain Django app, but it grew from that to various languages such as elixir, Java (multiple times with different architecture approaches), native Android, JS/TS Frontend and sometimes backend - react, angular, trpc, svelte tanstack and more.
So I know exactly how to implement it, as I've give through a lot of version for the same functionality.
And the last version I implemented (tanstack) was in July, via Claude Code and got to feature parity (and more) within roughly 3 weeks.
And I might add: I'm not positive about this development either, whatsoever. I'm just expecting this to happen, to the detriment of our collective futures (as programmers)
> You should expect most companies to let people go at staggering numbers, with only small amounts of highly skilled people left to administer the agents
I'm gonna pivot to building bomb shelters maybe
Or stockpiling munitions to sell during the troubles
Maybe some kind of protest support saas. Molotov deliveries as a service, you still have to light them and throw them but I guarantee next day delivery and they will be ready to deploy into any data center you want to burn down
What Im trying to say is "companies letting people go in staggering numbers" is a societal failure state not an ideal
I find it so weird how many engineers seem positively giddy to get replaced by a chatbot that functionally cannot do the job. Ill help your molotovs as a service startup, free guillotine with every 6th order.
So what happens when someone calls in and the "AI" answers (because the receptionist has been fired and replaced by "AI"), and the caller asks to access some company record that should be private? Will the LLM always deny the request? Hint: no, not always.
There are so many flaws in your plan, I have no doubt that "AI" will ruin some companies that try to replace humans with a "tin can". LLMs are being inserted loosey-goosey into too many places by people that don't really understand the liability problems it creates. Because the LLM doesn't think, it doesn't have a job to protect, it doesn't have a family to feed. It can be gamed. It simply won't care.
The flaws in "AI" are already pretty obvious to anyone paying attention. It will only get more obvious the more LLMs get pushed into places they really do not belong.
The human receptionist can use critical thinking, and self preservation to prevent a bad outcome. The LLM can not. When a person causes a problem, they can be fired, and learn from the event. The LLM will not learn from it. And who is responsible then? The company providing the LLM? The more LLM use becomes pervasive, the taller the house of cards gets.
> until I actually started to use some in the same toy project
Thats the key right there. Try to use it in a project that handles PII, needs data to be exact, or has many dependencies/libraries and needs to not break for critical business functions.