I tried doing LLM-based tests, coined it "agentic-tests" and it worked quite well:
The idea was to use Stagehand [1] as the testing framework and then integrate this with Linear, which is our ticketing system. During a hackathon I whipped something together: first the 'agent' read a UAT from linear, then it pass this into a quite heavily prompted Stagehand. The prompts instructed Stagehand to run the UAT after its best ability and make very structured notes on each step on what failed. Once the Stagehand process was done, 'the agent' reported which steps succeeded and failed and into Linear.
Fundamentally the idea was sound, but there were some limitations in both the Linear SDK and in Stagehand. With some better tooling (or a novel system) and I predict this sort of agentic testing will work very well, especially for exploratory testing where the agent may be prompted to act like either a 90 year old grandma or 16 year old turbogamer. Privacy-safe usage heatmaps may also be generated automatically to test out the UX, as each run yielded slightly different approaches to achieve the UATs.
The idea was to use Stagehand [1] as the testing framework and then integrate this with Linear, which is our ticketing system. During a hackathon I whipped something together: first the 'agent' read a UAT from linear, then it pass this into a quite heavily prompted Stagehand. The prompts instructed Stagehand to run the UAT after its best ability and make very structured notes on each step on what failed. Once the Stagehand process was done, 'the agent' reported which steps succeeded and failed and into Linear.
Fundamentally the idea was sound, but there were some limitations in both the Linear SDK and in Stagehand. With some better tooling (or a novel system) and I predict this sort of agentic testing will work very well, especially for exploratory testing where the agent may be prompted to act like either a 90 year old grandma or 16 year old turbogamer. Privacy-safe usage heatmaps may also be generated automatically to test out the UX, as each run yielded slightly different approaches to achieve the UATs.
Testing teams need not apply!
[1]: https://www.stagehand.dev/