I don’t have much in the way of tests right now but I am building with Typescript and Rust so that catches many basic bugs.
I don’t find the issue to be breaking other parts of the app, more-so that new features don’t work as advertised by Claude.
One of my takeaways here is that I should give Claude an integration test harness and tell it that it must finish running that successfully before committing any code.
I'm trying to prototype extremely quickly and I'm working on my project alone so yes, often I accept PRs without looking too closely at the code if my local testing succeeds.
I'm using Typescript and Rust and I think it's critical to use strict typing with LLMs to catch simple bugs.
I've worked at Uber as an infra engineer and at Gem as an engineering manager so I do consider myself an "actual professional developer". The critical bit is the context of the project I'm working on. If I were at a tech company building software, I'd be much more reticent to ship AI generated PRs whole cloth.
I don't have a ton of tests. From what I've seen, Claude will often just update the tests to no-op so tests passing isn't trustworthy.
My workflow is often to plan with ChatGPT and what I was getting at here is ChatGPT can often hallucinate features of 3rd party libraries. I usually dump the plan from ChatGPT straight into Claude Code and only look at the details when I'm testing.
That said, I've become more careful in auditing the plans so I don't run in to issues like this.
Tell Claude to use a code review sub agent after every significant change set, tell them to run the tests and evaluate the change set, don't tell Claude it wrote the code, and give them strict review instructions. Works like a charm.
Yes. Go on ChatGPT, explain what you're doing (claude code, trying to get it to be more rigorous with itself and reduce defects) then click deep research and tell it you'd like it to look up code review best practices, AI code review, smells/patterns to look out for in AI code, etc. Then have it take the result of that and generate a XML structured document with a flowchart of the code review best practices it discovered, cribbing from an established schema for element names/attributes when possible, and put it in fenced xml blocks in your subagent. You can also tell claude code to do deep research, you just have to be a little specific about what it should go after.
MCP up Playwright, have a detailed spec, and tell claude to generate a detailed test plan for every story in the spec, then keep iterating on a test -> fix -> ... loop until every single component has been fully tested. If you get claude to write all the components (usually by subfolder) out to todos, there's a good chance it'll go >1 hour before it tries to stop, and if you have an anti-stopping hook it can go quite a bit longer.
It takes an incredibly detailed spec to get an LLM to not go completely off the rails and even then. The amount of time writing that spec can take more time than just doing it by hand.
There is way too much babysitting with these things.
I’m sure somehow somebody makes it work but I’m incredibly skeptical that you can let an LLM run unsupervised and only review its output as a PR.
> The amount of time writing that spec can take more time than just doing it by hand.
one thing about doing it by hand is you also notice holes/deficiencies in the spec and can go back and update it, make the product better, but just throwing it to an llm 'til its perfect-to-spec probably means its just going to be average quality at best...
tho tbh most software isn't really 'stunning' imo so maybe thats fine as far as most businesses are concerned... (sad face)
Can you elaborate on what you mean by anti stopping hook? Sometime I take breaks, go on walks, etc and it would be cool of Claude tried different things and even branches etc that I could review when back.
Basically, all LLMs are "lazy" to some degree and are looking for ways to terminate responses early to conform to their training distribution. As a result, sometimes an agent will want to stop and phone home even if you have multiple rows of all caps saying DO NOT STOP UNTIL YOUR ENTIRE TODO LIST IS COMPLETE (seriously). Claude code has a hook for when the main agent and subagents try to stop, and you can reject their stop attempt with a message. They can still override that message and stop but the change of turn and the fresh "DO NOT STOP ..." that's at the front of context seem to keep it revving for a long time.
Coding agents are the future and it's anyone's game right now.
The main reason I think there is such a proliferation is it's not clear what the best interface to coding agents will be. Is it in Slack and Linear? Is it on the CLI? Is it a web interface with a code editor? Is it VS Code or Zed?
Just like everyone has their favored IDE, in a few years time, I think everyone will have their favored interaction pattern for coding agents.
Product managers might like Devin because they don't need to setup an environment. Software engineers might still prefer Cursor because they want to edit the code and run tests on their own.
Cursor has a concept of a shadow workspace and I think we're going to see this across all coding agents. You kick off an async task in whatever IDE you use and it presents the results of the agent in an easy to review way a bit later.
As for Void, I think being open source is valuable on it's own. My understanding is Microsoft could enforce license restrictions at some point down the road to make Cursor difficult to use with certain extensions.
I don’t find the issue to be breaking other parts of the app, more-so that new features don’t work as advertised by Claude.
One of my takeaways here is that I should give Claude an integration test harness and tell it that it must finish running that successfully before committing any code.