Thanks Max! This was a really interesting article and closely matches my own experience with how the agents have been progressing
one of the takeaways I get when reading skilled engineers' experiences with these tools is that they essentially offer leverage, and the more skill someone already has the higher their ceiling will be
i feel similarly. suppose ai makes people more productive:
1. companies that are not doing well (slow growth, losing to competition etc) or are in a monopoly and are under pressure to save in the short term are going to use the added productivity to reduce their opex
2. companies that are doing well (growth, in competitive markets) will get even more work done and can't hire enough people
my hunch is block is not doing as well as they seem to be
obviously he's going to posture his company as growing and doing well, but clearly not enough for the board and shareholders given their headcount growth from zirp
some companies are in the position to go for moonshots and block hasn't panned out
They did not, you get the same date range and the same graph shape going to FRED and pressing the "1Y" option, and the series includes the first two months of 2026 so it's 12 months: https://fred.stlouisfed.org/graph/?g=1SGzm
However, the chart settings were actually modified to hide/deemphasize the earlier decline: the the index date was changed. 2025-02-20=100 in their graph, default of 2020-02-01=100 would have the chart start at 64 and rise to 71.44.
Sure, I assumed status quo everyone is talking about is basically the several years before that graph. I still think it's relatively bad compared to that despite the modest improvement.
What's not shown in a graph of job postings is the demand side. With all the layoffs, out of work college grads, people staying put in jobs they are unhappy with, etc., I'd wager that demand per job is still at a historically high level compared to what we have been accustomed to
That's the most recent time. But I've bounced around all the LLMs - they're all superficially amazing. But if you understand their output they often wrong in both subtle and catastrophic ways.
As I said, maybe I'm wrong. I hope you have fun using them.
Yes. And, again, they look amazing and make you feel like you're 10x.
But then I look at the code quality, hideous mistakes, blatant footguns, and misunderstood requirements and realise it is all a sham.
I know, I know. I'm holding it wrong. I need to use another model. I have to write a different Soul.md. I need to have true faith. Just one more pull on the slot machine, that'll fix it.
"CEO Dario Amodei predicted last March that in six months AI would be writing 90% of code, and when that didn’t happen"
I mean, a lot of developers have 90% of their code being written by AI (myself and my friends at the labs included). Obviously YMMV depending on your codebase and individual skill.
"Software engineers will at times overestimate their capabilities, as demonstrated by the METR study that found that developers believed they were 24% faster when using LLMs, when in fact coding models made them 19% slower.
This, naturally, makes them quite defensive of the products they use, and whether or not they’re actually seeing improvements."
I wonder what he thinks about the new METR update that showed a net speedup as a lower bound (due to participants literally not wanting to even tackle tasks with AI due to how slow it would be), with the returning devs having the greatest improvements in speedup?
"for one of Anthropic’s greatest lies: that AI can “work uninterrupted” for periods of time, leaving the reader or listener to fill in the (unsaid) gap of “...and actually create useful stuff.”"
We're probably at the beginning of the S curve for long-running tasks that create useful stuff (https://ladybird.org/posts/adopting-rust/) but it clearly needs hand-holding and a way to self-verify work.
"No amount of DarioMath about how a model “costs this much and makes this much revenue” changes the fact that profitability is when a company makes more money than it spends."
Feels like he's being dishonest here because the economics of the labs are unique (and precarious). Each model (revenue - cost to train and serve) is profitable. Labs invest in the next model to maintain their advantage, otherwise people will stop using their latest models. This probably doesn't go on in perpetuity (which is what Ed should've analyzed more). To his benefit, he's right that CC subscriptions are currently being subsidized.
[Insert quotes of Dario saying models will be smarter than most humans or Nobel laureates]
I mean, he's not wrong in certain definitions of "smart". They're already well above the average human in terms of testable world knowledge, math, coding, science, etc... but obviously fall short in other ways compared to humans.
Really interesting updates to their 2025 experiment.
Repeat devs from the original experiment went from 0-40% slowdown to now -10-40% speedup - and METR estimates this as a 'lower-bound'
more devs saying they dont even want to do 50% of their work without AI, even for 50/hr
30-50% of devs decided not to submit certain tasks without AI, missing the tasks with the highest uplift
it also seems like there is a skill gap - repeat devs from the first study are more productive with ai tools than newly recruited ones with variable experience
overall it seems like the high preference for devs to use AI is actually hurting METR's ability to judge their speedup, due to a refusal to do tasks without it. imo this is indirectly quite supportive for ai coding's productivity claims.
The finding of the first study was people cannot judge their performance with these tools. So I don’t think the lack of individuals not willing to work without them is indicative of productivity improvements. I think it’s indicative of them being enjoyable to use.
It was claimed to find that, but I don't think it did. It compared developers' beliefs about average speed up across tasks, measured by asking them once at the end, compared to the average comparative speed measured per task and then averaged. That's measuring two different things, and all kinds of things could mass up developers' fuzzy recollection of the gestalt of several tasks (such as recency bias and question/study framing) that wouldn't effect it if you asked them right after; moreover, when tasks were broken down by task type, the speed up/slow down results actually matched developers' qualitative comments.
There are some people participating in the study who will fire & forget instructions to Claude/Codex running in parallel worktrees, but would really struggle if they were required to work on their project without AI assistance.
So while some study participants probably are seeing an actual speedup because of the discipline with which they manage their codebase's structure & documentation, other study participants are actually getting worse at non-AI coding.
...and METR's study can't tell which is which because METR's study isn't using any sort of codebase quality metrics for grounding.
surprised nobody responded with the most straightforward, occams razor explanation
they think what they're doing is actually good for society
not everyone is in the hackerspace libertarian / socialist sphere
i used to work for a place that used persona despite it adding extra friction to signups (literally resulting in less paying customers to the dismay of PMs) because it was worth it to combat fraud. theres a tradeoff in everything
one of the takeaways I get when reading skilled engineers' experiences with these tools is that they essentially offer leverage, and the more skill someone already has the higher their ceiling will be
reply