I'm getting a lot of side-quest productivity out of AI. There's always a bunch of things I could do, but they are tedious. Yet they are still things I wish I could get done. Those kinds of things AI is fantastic at. Building a mock, making tests, abstracting a few things into libraries, documentation.
So it's not like I'm delivering features in one day that would have taken two weeks. But I am delivering features in two weeks that have a bunch of extra niceties attached to them. Reality being what it is, we often release things before they are perfect. Now things are a bit closer to perfect when they are released.
I hope some of that extra work that's done reduces future bug-finding sessions.
What I'm about to discuss is about me, not you. I have no idea what kind of systems you build, what your codebase looks like, use case, business requirements etc. etc. etc. So it is possible writing tests is a great application for LLMs for you.
In my day to day work... I wish that developers where I work would stop using LLMs to write tests.
The most typical problem with LLM-generated tests on the codebase where I work is that the test code is almost extremely tightly coupled to the implementation code. Heavy use of test spies is a common anti-pattern. The result is a test suite that is testing implementation details, rather than "user-facing" behaviour (user could be a code-level consumer of the thing you are testing).
The problem with that type of a test is that is a fragile test. One of the key benefits of automated tests is that they give you a safety net to refactor implementation to your heart's content without fear of having broken something. If you change an implementation detail, and the "user-facing" behaviour does not change, your tests should pass. When tests are tightly coupled to implementation, they will fail and now your tests, in the worst of cases, might actually be creating negative value for you ... since you every code change now requires you to keep tests up to date even when what you actually care about testing "is this thing working correctly?" hasn't changed.
The root of this problem isn't even the LLM, it's just that the LLM makes it a million times worse. Developers often feel like writing tests are a menial chore that needs to be done after the fact to satisfy code coverage policy. Few developers, at many organizations, have ever truly worked TDD or learned testing best practices, how to write easy to test implementation code etc.
There are some patterns you can use that help a bit with this problem. Lowest hanging fruit is to tell the LLM that its tests should test only through public interfaces where possible. Next after that is to add a "check if any non-public interfaces were used in places where a public interface exposes the same functionality the not-yet-committed tests - if so, rewrite tests to use only publicly exposed interfaces" step to the workflow. You could likely also add linter rules, though sometimes you genuinely need to test something like error conditions that can't reasonably be tested only through public interfaces.
Oh don't get me wrong. I'm sure that an LLM can write a decent test that doesn't have the problems I described. The problem is that LLMs are making a preexisting problem much, MUCH worse.
That problem statement is:
- Not all tests add value
- Some tests can even create dis-value (ex: slow to run, thus increasing CI bills for the business without actually testing anything important)
- Few developers understand what good automated testing looks like
- Developers are incentivized to write tests just to satisfy code coverage metrics
- Therefore writing tests is a chore and an afterthought
- So they reach for an LLM because it solves what they perceive as a problem
- The tests run and pass, and they are completely oblivious to the anti-patterns just introduced and the problems those will create over time
- The LLMs are generating hundreds, if not thousands, of these problems
So yeah, the problem is 100% the developers who don't understand how to evaluate the output of a tool that they are using.
But unlike functional code, these tests are - in many cases - arguably creating disvalue for the business. At least the functional code is a) more likely to be reviewed and code quality problems addressed and b) even if not, it's still providing features for the end user and thus adding some value.
Force the LLM to write property-based tests (depends on the language you use whether or not good libraries are available -- but if they are available 100% make use of them). Iterate with the LLM on the invariants.
Forcing the discussion of invariants, and property-based testing -- seems to improve on the issues you're mentioning (when using e.g. Opus 4), especially when combined with the "use the public API" or interface abstractions.
Side-quest productivity is a great way to put it... It does feel like AI effectively enables the opposite of "death by a thousand cuts" (life by a thousand bandaids?)
For much of what I build with AI, I'm not saving two weeks. I'm saving infinity weeks — if LLMs didn't exist I would have never built this tool in the first place.
So it's not like I'm delivering features in one day that would have taken two weeks. But I am delivering features in two weeks that have a bunch of extra niceties attached to them. Reality being what it is, we often release things before they are perfect. Now things are a bit closer to perfect when they are released.
I hope some of that extra work that's done reduces future bug-finding sessions.