Cool! Before building a full test platform for testdriver.ai we made a similar sdk called Goodlooks. It didn't get much traction, but will leave it here for those interested:
https://github.com/testdriverai/goodlooks
TestDriver.ai | https://testdriver.ai | QA Engineers, Sales | Austin / Remote | Full-time / Part Time
95% of companies are still wasting time manually testing due to shortcomings in Playwright, Cypress, and other frameworks. Developers rank testing as the #1 blocker to release.
We've built an AI Agent that performs manual testing on it's own VM with complete desktop access. It works like a specialized "Claude Computer Use."
We're scaling our early sales and seeking QA engineers, customer support, and sales engineers.
- A list of applications that are open
- Which application has active focus
- What is focused inside the application
- Function calls to specifically navigate those applications, as many as possible
We’ve found the same thing while building the client for testdriver.ai. This info is in every request.
Yes, you are correct that it entirely lays in the reputation of the AI.
This discussion leads to interesting question, which is "what is quality?"
Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".
For more, read "Zen and the Art of Motorcycle Maintenance"
This comes up all the time. It seems like it would be possible, but imagine the case where you want to verify that a menu shows on hover. Was the hover on the menu intentional?
Another example, imagine an error box shows up. Was that correct or incorrect?
So you need to build a "meta" layer, which includes UI, to start marking up the video and end up in the same state.
Our approach has been to let the AI explore the app and come up with ideas. Less interaction from the user.
My way of thinking while working of B2 enterprise app, sometimes users come up from weird scenarios in feature with X turn on, off with specific edition (country).
Maybe the gpt can surf the user activity logs or crash logs and reproduce the scenarios as test case.
I predict an all out war over deterministic vs non-deterministic testing, or at least a new buzzword for fuzzy testing. Product people understand that a cookie banner "shouldn't" prevent the test from passing, but an engineer would entirely disagree (see the rest of the convos below).
Engineers struggle with non-deterministic output. It removes the control and "truth" that engineering is founded upon. It's going to take a lot of work (or again, a toung-in-cheek buzzword like "chaos testing") to get engineers to accept the non-deterministic behavior.
An average typing speed is 40wpm but an average conversation is between 120 - 150 wpm so about 3 - 4x bandwidth.
Calls also offer sub second latency and maximum priority.
When you add video and audio in there, the pure amount of data transferred is higher.