This point about "some people already said six months ago that it was better than it was six months ago" is regularly trotted out in threads like this like it's some sort of trump card that proves AI is just hype. It doesn't make sense to me. What else do you expect people to be saying about a rapidly-improving technology? How does it help you to distinguish technologies that are hype from those that are not?
I'm sure people were saying similar things about, say, aviation all through the first decades of the 20th century, "wow, those planes are getting better every few years"... "Until recently planes were just gimmicks, but now they can fly across the English channel!"... "I wouldn't have got in one of those death traps 5 years ago, but now I might consider it!" And different people were saying things like that at different times, because they had different views of the technology, different definitions of usefulness, different appetites for risk. It's just a wide range of voices talking in similar-sounding terms about a rapidly-developing technology over a span of time.
This is just how people are going to talk about rapidly-improving technologies for which different people have different levels of adoption at different times. It's not a terribly interesting point. You have to engage with the specifics, I'm afraid.
The second half of that argument was not in this article. The author was just relating his experience.
For what it is worth, I have also gone from a "this looks interesting" to "this is a regular part of my daily workflow" in the same 6 month time period.
I’m a light LLM user myself and I still write most of the important code by myself.
Even I can see there has been a clear advancement in performance in the past six months. There will probably be another incremental step 6 months from now.
I use LLMs in a project that helps give suggestions for a previously manually data entry job. Six months ago the LLM suggestions were hit or miss. Using a recent model it’s over 90% accurate. Everything is still manually reviewed by humans but having a recent model handle the grunt work has been game changing.
If people are drinking a firehose of LinkedIn style influencer hype posts I could see why it’s tiresome. I ignore those and I think everyone else should do. There is real progress being made though.
I think the rapid iteration and lack of consistency from the model providers is really killing the hype here. You see HN stories all the time around how things are getting worse, and it seems folks success with the major models is starting to heavily diffuse.
The model providers should really start having LTS (at least 2 years) offerings that deliver consistent results regardless of load, IMO. Folks are tired of the treadmill and just want some stability here, and if the providers aren't going to offer it, llama.cpp will...
Yea I hear this a lot, do people genuinely dismiss that there has been step change progress over 6-12 months timescale? I mean it’s night and day, look at benchmark numbers… “yea I don’t buy it” ok but then don’t pretend you’re objective
I think I'd be in the "don't buy it" camp, so maybe I can explain my thinking at least.
I don't deny that there's been huge improvements in LLMs over the last 6-12 months at all. I'm skeptical that the last 6 months have suddenly presented a 'category shift' in terms of the problems LLMs can solve (I'm happy to be proved wrong!).
It seems to me like LLMs are better at solving the same problems that they could solve 6 months ago, and the same could be said comparing 6 months to 12 months ago.
The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.
That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
The improvement in LLMs has come in the form of more successful one shots, more successful bug finding, more efficient code, less time hand-holding the model.
"Problem solving" (which definitely has improved, but maybe has a spikey domain improvement profile) might not be the best metric, because you could probably hand hold the models of 12 months ago to the same "solution" as current models, but you would spend a lot of time hand holding.
> The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.
Yes I agree here in principle here in some cases: I think there are certainly problems that LLMs are now better at but that don't reach the critical reliability threshold to say "it can do this". E.g. hallucinations, handling long context well (still best practice to reset context window frequently), long-running tasks etc.
> That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
This is where I disagree (but again you are absolutely right for certain classes of capabilities and problems).
- Claude code did not exist until 2025
- We have gone from e.g. people using coding agents for like ~10% of their workflow to like 90-100% pretty typically. Like code completion --> a reasonably good SWE (with caveats and pain points I know all too well). This is a big step change in what you can actually do, it's not like we're still doing only code completion and it's marginally better.
- Long horizon task success rate has now gotten good enough that basically also enable the above (good SWE) for like refactors, complicated debugging with competing hypotheses, etc, looping attempts until success
- We have nascent UI agents now, they are fragile but will see a similar path as coding which opens up yet another universe of things you can only do with a UI
- Enterprise voice agents (for like frontline support) now have a low enough bounce rate that you can actually deploy them
So we've gone from "this looks promising" to production deployment and very serious usage. This may kind of be like you say "same capabilities but just getting gradually better" but at some point that becomes a step change. Before a certain failure rate (which may be hard to pin down explicitly) it's not tolerable to deploy, but as evidenced by e.g. adoption alone we've crossed that threshold, especially for coding agents. Even sonnet 4 -> opus 4.5 has for me personally (beyond just benchmark numbers) made full project loops possible in a way that sonnet 4 would have convinced you it could and then wasted like 2 whole days of your time banging your head against the wall. Same is true for opus 4.5 but its for much larger tasks.
> To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
Precisely. Lots and lots of hyperbole, some with varying degrees of underlying truth. But I would say: the true underlying reality here is somewhat easy to follow along with hard numbers if you look hard enough. Epoch.ai is one of my favorite sources for industry analysis, and e.g. Dwarkesh Patel is a true gift to the industry. Benchmarks are really quite terrible and shaky, so I don't necessarily fault people "checking the vibes", e.g. like Simon Willison's pelican task is exactly the sort of thing that's both fun and also important!
> The lower-bound estimate represents 18 percent of the total reduction in man-hours in U.S. agriculture between 1944 and 1959; the upper-bound estimate, 27 percent
According to Wikipedia, the Ivel Agricultural Motor was the first successful model of lightweight gasoline-powered tractor. The year was 1903. You're like someone being dismissive in 1906 because "nothing happened yet".
This is an article written by a company/llm trying to justify huge increases to the pricing structure.
Oh! Yknow that thing we were charging you $200 a month for now? We're going to start charging you for the value we provide, and it will now be $5,000 a month.
Meanwhile, the metrics for "value" are completely gamed.
The price will be what you are willing to pay. No justification required, excepting for fairness (info asymmetry and what else?). It is written by me. Unfunded bootstrapped !!call it dire straits.
At the same time, I actually wouldn’t mind a world in which AI agents cost $5000 a month if that’s what companies want to charge.
I feel like at some level that would remove the possibility of making a “just as good as humans but basically free” arguments and move discussion in the direction that feels more productive: discussing real benefits and shortcomings of both. Eg, loss of context with agents vs HR costs with humans, etc…
If the AI does all the easy tickets, there's no easing in new hires, so that process is going to be more expensive, so I better get discounted for that hit.
If there is zero slack, and only the hardest parts, this is no longer the job it was before. Salaries will have to go up, or retention will go down. In addition these jobs could already be awful when there was some slack, removing all slack tasks to AI is going to make them miserable so average customer interaction once they get to a human agent is probably going to be worse so your customer satisfaction will take a hit. So I better get discounted with that reputational hit.
It's like the 'have AI pick the tomatoes it can, and the field worker the rest'. Picking the easy tomatoes is factored into the job. Having the ai pick the easy ones could break the whole model. Of having zero slack for the workers could break them and result in no one showing up to jobs where AI has done the easy picking.
You sound incredibly short sighted. Yeah slack and making sure people don't just get unwinnable tickets all day is important for retention. And if your company needs more than warm bodies reading a script, yeah, you account for it.
Most machinery you can't run 100% capacity. Most machinery you can't run 24/7. You schedule load. You schedule downtime. And the higher the capacity, the more the machine costs. If you aren't aware of this for your people you are failing at your job.
Not sure I follow. But, the first paragraph is interesting.
You are saying, employees stick around if they are given easy tickets, and companies care about passing along easy tickets so warm bodies do not churn.
TL;DR: on-call manages acute issues, documents steps taken, possibly farms out immediate work to subject matter experts. Rate on-call based on traces they leave behind. Separate on-call with same population, but longer rotation window handles fixes. Rate this rotation based on root cause reoccurrence and general ticket stats trendlines.
Longer reply:
I have on-call experience for major services (DynamoDB front door, CosmosDB storage, OCI LoadBalancer). Seen a lot of different philosophies. My take:
1. on-call should document their work step by step in tickets and make changes to operational docs as they go: a ticket that just has "manual intervention, resolved" after 3 hours is useless; documenting what's happening is actually your main job; if needed, work to analyze/resolve acute issues can be farmed out
2. on-call is the bus driver, shouldn't be tasked with handling long term fixes (or any other tasks beyond being on-call)
3. handover between on-calls is very important, prevents accidentally dropping the ball on resolving longer time horizon issues; handover meetings
Probably the most controversial one: separate rotation (with a longer window - eg. 2 week) should handle tasks that are RCA related or drive fixes to prevent reoccurrence
Managers should not be first tier on any pager rotation, if you wouldn't approve pull requests, you shouldn't be on the rotation (other than as a second tier escalation). Reverse should also hold: if you have the privilege to bless PRs, you should take your turn in the hot seat.
I've been playing around with a limited PCB autorouter for mechanical keyboards. You feed it a KLE layout file, and it spits out a KiCad PCB, layout map for QMK firmware, and various SVG cuts for case machining.
The browser instance knows it is being driven by an automation agent. If you so wanted, you can actually comment out the code that does that in the browser's code but since this new setup will enable the page to check if you compiled your own browser, they'll be able to incorporate the "isUnderAutomation" flag under the attestation data and that's sealed because you can't build your own browser and have it attest.
It’s always this tired argument. “But it’s so much better than six months ago, if you aren’t using it today you are just missing out.”
I’m tired of the hype boss.