That’s a great point! While we do our best to simulate an identical case, if the agent responds differently, our focus is on whether the key evaluator or goal for that replay set passes or fails. We use that as the source of truth and flag the exact moment where the conversation diverges from the expected flow.
We help capture discrepancies between your transcription model and errors in order to effectively calculate a Word Error Rate (WER) as part of your evaluation process. Post-call transcription tends to be more accurate, and we’ve seen teams manually do this by hiring humans to label a dataset and test against it for WER calculations.
By providing a more accurate baseline, Roark helps teams quantify how well their production transcriptions match reality and flag cases where the model is introducing errors that could impact downstream agent performance. That way, you’re not just testing if your agent responds correctly, but whether it’s getting the right inputs in the first place.
Great question! Most teams deploying voice AI do have testing and QA flows, but they’re often manual, brittle, or incomplete. Unlike traditional software, voice agents don’t have structured inputs and outputs — users phrase things unpredictably, talk over the bot, or express frustration in subtle ways.
Some engineering teams try to build internal testing frameworks, but it’s a massive effort - they have to log and store call data, build a replay system, define evaluation criteria, and continuously update it as the AI evolves. Most don’t want to spend engineering time reinventing the wheel when they could be improving their AI instead.
The teams that benefit most from Roark are the ones with strong QA processes — they already know how critical testing is, but they’re stuck with brittle, time-consuming, or incomplete workflows.
Great question! There’s been a lot of movement in this space, but most existing solutions focus on simulation-based testing—generating synthetic test cases or scripted evaluations.
Roark takes a different approach: we replay real production calls against updated AI logic, preserving actual user inputs, tone, and timing. This helps teams catch failures that scripted tests miss—especially in high-stakes industries like healthcare, legal, and finance, where accuracy and compliance matter.
Beyond replays, we provide rich analytics, sentiment & vocal cue detection, and automated evaluations, all based on audio—not just transcripts. This lets teams track frustration, long pauses, and interruptions that often signal deeper issues.
Would love to hear more about your assistant - how are you thinking about testing and iteration?
We are presently evaluating at least two of the options mentioned above and the pricing for 1000 minutes at both comes out to well under 10% the cost of your current listed rate. I know you've probably been told not to compete on price but this is a space where I think it's hard to compete on quality yet. As for other analysis features, I think you're going to find yourself locked into a commoditised feature race which you're currently 6 months behind on.
What nobody is doing well at the moment is effective prompt versioning, comparison, and deployment via git or similar. This would be a killer feature for us but nobody is close to having shipped it from what I can see.
Appreciate the feedback! Completely agree - authentication should be handled at the system level, not just in prompts. This demo is meant to showcase how teams can build test cases from real failures and ensure fixes work before deployment. We’ll consider using a better example.
Appreciate the feedback! To clarify, Roark isn’t handling authentication itself - it’s a testing and observability tool to help teams catch when their AI fails to follow expected security protocols (like verifying identity before sharing sensitive info).
That said, totally fair point that this example could be clearer—we’ll keep that in mind for future demos. Thanks for calling it out!