Hey all! Max and Matt here from Talc AI. We do automated QA for anything built on top of an LLM. Check out our demo:
https://talc.ai/demoWe’ve found that it's very difficult to know how well LLM applications (and especially RAG systems) are going to work in the wild. Many companies tackle this by having developers or contractors run tests manually. It’s a slow process that holds back development, and often results in unexpected behavior when the application ships.
We’ve dealt with similar problems before; Max was a staff engineer working on systematic technical solutions for privacy problems at facebook, and Matt worked on ML ops on facebooks’ election integrity team, helping run classifiers that handled trillions of data points. We learned that even the best predictive systems need to be deeply understood and trusted to be useful to product teams, and set out to build the same understanding in AI.
To solve this, we take ideas from academia on how to benchmark the general capabilities of language models, and apply them to generating domain specific test cases that run against your actual prompts and code.
Consider an analogy: If you’re a lawyer, we don’t need to be lawyers to open up a legal textbook and test your knowledge of the content. Similarly if you’re building a legal AI application, we don’t need to build your application to come up with an effective set of tests that can benchmark your performance.
To make this more concrete - when you pick a topic in the demo, we grab the associated wikipedia page and extract a bunch of facts from it using a classic NLP technique called “named entity recognition”. For example if you picked FreeBASIC, we might extract the following line from it:
Source of truth: "IDEs specifically made for FreeBASIC include FBide and FbEdit,[5] while more graphical options include WinFBE Suite and VisualFBEditor."
This line is our source of truth. We then use an LLM to work backwards from this fact into a question and answer:
Question: "What programming language are the IDEs WinFBE Suite and FbEdit designed to support?"
Reference Answer: "FreeBasic"
We can then evaluate accurately by comparing the reference answer and the original source of truth– this is how we generate “simple” questions in the demo.
In production we’re building this same functionality on our customers' knowledge base instead of wikipedia. We then employ a few different strategies to generate questions – these range from simple factual questions like “how much does the 2024 chevy tahoe cost”, to complex questions like “What would a mechanic have to do to fix the recall on my 2018 Golf?” These questions are based on facts extracted from your knowledge base and real customer examples.
This testing and grading process is fast – it’s driven by a mixture of LLMs and traditional algorithms, and can turn around in minutes. Our business model is pretty simple - we charge for each test created. If you opt to use our grading product as well we charge for each example graded against the test.
We’re excited to hear what the HN community thinks – please let us know in the comments if you have any feedback, questions or concerns!
>Which MLB player won the Sporting News MLB Rookie of the Year Award as a pitcher in 1980, and who did Cal Ripken Jr. surpass to hold the record for most home runs hit as a shortstop?
>What team did Britt Burns play for in the minor leagues before making his MLB debut, and in what year did Cal Ripken Jr. break the consecutive games played record?
>Who was the minor league pitching coordinator for the Houston Astros until 2010, and what significant baseball record did Cal Ripken Jr. break in 1995?
All five questions are a combination of a question about a Britt Burns fact and an unrelated Cal Ripken fact.
Why is this? Britt Burns doesn't seem to appear on the live Wikipedia page for Ripken. Does he appear on a cached version? Or is it forming complex questions by finding another page in the same category as Ripken and pulling more facts?