I personally care deeply when content intended as communication is AI generated (much more so than if code is generated).
On the surface level, I find it a bit disrespectful when I'm communicating with someone who's just using an LLM to generate their responses. Imagine if you are talking to someone in-person, and they pull out a phone, generate a response then read it back out to you?
On a deeper level, if someone's generated a bunch of text and clearly hasn't devoted the time into generating/editing it that they're expecting me to invest while reading it, I'm just not going to read it.
Yes it's very obviously written by AI and made me immediately close the tab. Not gonna read a self-promotional piece written by an LLM that someone probably only gave it one sentence prompt: "merge these ideas".
I don’t think anyone serious would recommend it for serious production systems. I respect the Ralph technique as a fascinating learning exercise in understanding llm context windows and how to squeeze more performance (read: quality) from today’s models
Even if in the absolute the ceiling remains low, it’s interesting the degree to which good context engineering raises it
How is it a “fascinating learning exercise” when the intention is to run the model in a closed loop with zero transparency. Running a black box in a black box to learn? What signals are you even listening to to determine whether your context engineering is good or whether the quality has improved aside from a brief glimpse at the final product. So essentially every time I want to test a prompt I waste $100 on Claude and have it an entire project for me?
I’m all for AI and it’s evident that the future of AI is more transparency (MLOPs, tracing, mech interp, AI safety) not less.
there is the theoretical "how the world should be" and there is the practical "what's working today" - decry the latter and wait around for the former at your peril
I read it. i agree this is out of touch. Not because the things its saying are wrong, but because the things its saying have been true for almost a year now. They are not "getting worse" they "have been bad". I am staggered to find this article qualifies as "news".
If you're going to write about something that's been true and discussed widely online for a year+, at least have the awareness/integrity to not brand it as "this new thing is happening".
The models are gotten very good, but I rather have an obviously broken pile of crap that I can spot immediately, than something that is deep fried with RL to always succeed, but has subtle problems that someone will lgtm :( I guess its not much different with human written code, but the models seem to have weirdly inhuman failures - like, you would just skim some code, cause you just cant believe that anyone can do it wrong, and it turns out to be.
Well, for some reason it doesnt let me respond to the child comments :(
The problem (which should be obvious) is that with a/b real you cant construct an exhaustive input/output set. The test case can just prove the presence of a bug, but not its absence.
Another category of problems that you cant just test and have to prove is concurrency problems.
Of course you can. You can write test cases for anything.
Even an add_numbers function can have bugs, e.g. you have to ensure the inputs are numbers. Most coding agents would catch this in loosely-typed languages.
engineers always want to re write from scratch and it never works.
a tale as old as time - my second job out of college back in like 2016, I landed at the tail end of a 3-month feature-freeze refactor project. was pitched to the CEO as 1-month, sprawled out to 3 months, still wasn't finished. Non-technical teams were pissed, technical teams were exhausted, all hope was lost. Ended up cutting a bunch of scope and slopping out a bunch of bugs anyway.
i had the privilege of working w/ some incredible eng leaders at my previous gig - they were very good at working both upwards and downwards to execute against the "50/50" rule - half of any given sprint's work is focused on new features, and half is focused on bug fixes, chores, things that improve team velocity.
I think the key here is “if X then Y syntax” - this seems to be quite effective at piercing through the “probably ignore this” system message by highlighting WHEN a given instruction is “highly relevant”
agree - i've had claude one-shot this for me at least 10 times at this point cause i'm too lazy to lug whatever code around. literally made a new one this morning
just because you don't type out the characters doesn't mean you're not designing systems and thinking critically and leveraging your experience.
also: do we think this is written by ai? do we care anymore?
reply