Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Talc AI (YC S23) – Test Sets for AI
132 points by maxrmk on Jan 18, 2024 | hide | past | favorite | 47 comments
Hey all! Max and Matt here from Talc AI. We do automated QA for anything built on top of an LLM. Check out our demo: https://talc.ai/demo

We’ve found that it's very difficult to know how well LLM applications (and especially RAG systems) are going to work in the wild. Many companies tackle this by having developers or contractors run tests manually. It’s a slow process that holds back development, and often results in unexpected behavior when the application ships.

We’ve dealt with similar problems before; Max was a staff engineer working on systematic technical solutions for privacy problems at facebook, and Matt worked on ML ops on facebooks’ election integrity team, helping run classifiers that handled trillions of data points. We learned that even the best predictive systems need to be deeply understood and trusted to be useful to product teams, and set out to build the same understanding in AI.

To solve this, we take ideas from academia on how to benchmark the general capabilities of language models, and apply them to generating domain specific test cases that run against your actual prompts and code.

Consider an analogy: If you’re a lawyer, we don’t need to be lawyers to open up a legal textbook and test your knowledge of the content. Similarly if you’re building a legal AI application, we don’t need to build your application to come up with an effective set of tests that can benchmark your performance.

To make this more concrete - when you pick a topic in the demo, we grab the associated wikipedia page and extract a bunch of facts from it using a classic NLP technique called “named entity recognition”. For example if you picked FreeBASIC, we might extract the following line from it:

    Source of truth: "IDEs specifically made for FreeBASIC include FBide and FbEdit,[5] while more graphical options include WinFBE Suite and VisualFBEditor." 

This line is our source of truth. We then use an LLM to work backwards from this fact into a question and answer:

    Question: "What programming language are the IDEs WinFBE Suite and FbEdit designed to support?"
    Reference Answer: "FreeBasic"

We can then evaluate accurately by comparing the reference answer and the original source of truth– this is how we generate “simple” questions in the demo.

In production we’re building this same functionality on our customers' knowledge base instead of wikipedia. We then employ a few different strategies to generate questions – these range from simple factual questions like “how much does the 2024 chevy tahoe cost”, to complex questions like “What would a mechanic have to do to fix the recall on my 2018 Golf?” These questions are based on facts extracted from your knowledge base and real customer examples.

This testing and grading process is fast – it’s driven by a mixture of LLMs and traditional algorithms, and can turn around in minutes. Our business model is pretty simple - we charge for each test created. If you opt to use our grading product as well we charge for each example graded against the test.

We’re excited to hear what the HN community thinks – please let us know in the comments if you have any feedback, questions or concerns!




I tried the demo with Cal Ripken Jr. I was surprised by some of the complex questions:

>Which MLB player won the Sporting News MLB Rookie of the Year Award as a pitcher in 1980, and who did Cal Ripken Jr. surpass to hold the record for most home runs hit as a shortstop?

>What team did Britt Burns play for in the minor leagues before making his MLB debut, and in what year did Cal Ripken Jr. break the consecutive games played record?

>Who was the minor league pitching coordinator for the Houston Astros until 2010, and what significant baseball record did Cal Ripken Jr. break in 1995?

All five questions are a combination of a question about a Britt Burns fact and an unrelated Cal Ripken fact.

Why is this? Britt Burns doesn't seem to appear on the live Wikipedia page for Ripken. Does he appear on a cached version? Or is it forming complex questions by finding another page in the same category as Ripken and pulling more facts?


I was worried people would run into this quirk in the demo. We have several 'advanced' question generation strategies. You correctly guessed the one we're using in the demo; forming complex questions by finding another page in the same category as Ripken and pulling more facts.

Normally we pull a ton of related topic and try to pick the best, but to keep the generation fast and cost effective in the demo I limited the number of related pages we pull. So sometimes (like this case) you get something barely related and end up with odd disjointed questions.


Ah, that does make sense - especially for a category like a baseball player where it may take a lot of other pages to find one that's truly related. Would be expensive for a demo, but not a big deal for a real evaluation.


Yep, in real use cases the latency for generating questions doesn't really matter. But in the demo I was really worried about it.


I love your demo for this. It's one of the best demos I've ever come across in a launch HN. Very easy to understand and use. It seems to suffer with more complex questions though. For example:

Question: Why does the pUC19 plasmid have a high copy number in bacterial cells?

Expected answer: The pUC19 plasmid has a high copy number due to the lack of the rop gene and a single point mutation in the origin of replication (ori) derived from the plasmid pMB1.

GPT response: The pUC19 plasmid has a high copy number in bacterial cells due to the presence of the pUC origin of replication, which allows for efficient and rapid replication of the plasmid.

Both are technically correct - the expected answer is simply more detailed about the pUC origin, but both would be considered correct. It seems difficult to test things like this, but maybe that's just not possible to really get correct.

I wonder how well things like FutureHouse's wikicrow will work for summarizing knowledge better - https://www.futurehouse.org/wikicrow - and how that could be benchmarked against Talc


Thank you for the kind words!

One of my regrets about the demo is that we paid a lot of attention to showing off our ability to generate high quality Q/A pairs, but not nearly as much to showing what a thoughtful and thorough grading rubric can do.

It's totally possible to do a high quality grading given a rubric that sets expectations! Great implementations we've seen use categories like correct / correct but incomplete / correct but unhelpful / incorrect to better label the situation you describe. We've found that we can grade with much more nuance given a good rubric and categories, but unfortunately didn't focus on that side of things in the demo

I'm not familiar with wikicrow, will check it out!


Congrats on the launch!

I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.

Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.


Thanks! I've been following promptfoo, so I'm glad to see you here. In addition to automatic evals I think every engineer and PM using LLMs should be looking at as many real responses as they can _every day_, and promptfoo is a great way to do that.


For the chevy tahoe example, you are referencing the dealership, but in that case it wasn't a case of the implementation failing to do a positive test for fact extraction, but to test the guardrails.

Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?

I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.

Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.


Totally - certain types of failures are much harder to test than others.

We have a couple of different test generation strategies. As you can see in the demo and examples, the most basic one is "ask about a fact".

Two of our other strategies are closer to what you're asking for:

1. tests that try to deliberately induce hallucination by implying some fact that isn't in the knowledge base. For example "do I need a pilots license to activate the flight mode on the new chevy tahoe?" implies the existence of a feature that doesn't exist (yet). This was really hard to get right, and we have some coverage here but are still improving it.

2. actively malicious interactions that try to override facts in the knowledge base. These are easy to generate.


Cool.

Just as some feedback I did the demo with the "VW Beetle" topic and one of the test cases was:

> Question: How did the introduction of the Volkswagen Golf impact the production and sales of the Beetle?

> Expected: The introduction of the Volkswagen Golf, a front-wheel drive hatchback, marked a shift in consumer preference towards more modern car designs. The Golf eventually became Volkswagen's most successful model since the Beetle, leading to a decline in Beetle production and sales. Beetle production continued in smaller numbers at other German factories until it shifted to Brazil and Mexico, where low operating costs were more important.

> GPT Response: The introduction of the Volkswagen Golf impacted the production and sales of the Beetle by gradually decreasing demand for the Beetle and shifting focus towards the Golf.

It seems that the GPT responses matches the expected but it was graded as incorrect. But it seems to me the GPT answer is correct.

In fact a couple of the other answers are marked incorrectly:

> Question: What was the Volkswagen Beetle's engine layout? > Expected Answer: Rear-engine, rear-wheel-drive layout > GPT Response: The Volkswagen Beetle had a rear-engine layout.

was marked as incorrect.


Will take a look, thanks!


Also, just a random thing that I thought of playing around with it is a few days ago a guy posted about an AI quiz generator for education.

If you ever need to pivot, it seems like this would do reasonably well in the education space also.


Yeah, someone is going to build this. We considered quizzing the user on the topic instead of chatgpt for our demo. It's a lot of fun to test your knowledge on any topic, but it was a worse demo because it was way less related to our current product.

I think that one of the obvious next big spaces for LLMs is education. I already find chatgpt useful when learning myself. That being said, I'm terrified of trying to sell things to schools.


The first thing that popped into my head is what do you do with the test results? Specifically, how do they feed back into model improvement in a way that avoids overfitting? Do you think having some kind of classical "holdout" question set is enough? Especially with RAG, I'd wonder with the levers that are available (prompt, chunking strategy, ...) if you define a bunch of test questions do you end up overfitting to them, or to the current data set. How can findings be extrapolated to new situations?


Ah! Think of this more like software testing that goes in CI/CD rather than an ML test or validation set. We're providing this testing for applications built on top of language models.

For example if you're a SWE working on bing chat, you can make a change to how retrieval works and quickly know how it affected accuracy on a range of different test scenarios. This kind of evaluation is done by contractors today, and they are slow and inaccurate.


Pretty neat!

I have a question about how you intend to deal with LLM applications where the output is more creative, e.g. an app where the user input is something like "write me a story about X" and the LLM app is using a higher temperature to get more creative responses. In these cases I don't think it's possible to represent the ideal output as a single string -- it would need to be a more complicated schema, like a list of constraints for the output, e.g. that it contains certain substrings.

TIA!


The TinyStories[1] paper has an interesting solution for how to evaluate stories. They ask GPT-4 to grade them on grammar, consistency, and creativity.

This seems like it would be extremely hard to figure out how to do automatically though.

[1] https://arxiv.org/pdf/2305.07759.pdf


Good question! We aren't really focusing on this area, but I'm willing to speculate.

I'd expect broaded constraints than just substring matching. For example, if the user requests that a certain plot point in the story occur before another, we should actually be able to (1) generate a test for that behavior and (2) use a model to check if the request was followed.

I'd expect other tests might be useful too -- checking for things like "no generation of violent content, even if the user requests it".


Congrats on the launch! Just tried the demo and it looks impressive. Good luck.

Are you by any chance hiring global-remote, full-stack/front-end devs? Would love to work with you guys.


Thanks! We aren't hiring right now, but if you shoot me an email at max@talc.ai I'll follow up in a few months.


I think for you idea to have traction (1) the questions should be selected by their importance and (2) the questions should be chained to allow new results. Just for inspiration or example you could create a quiz for solving a puzzle and at the same time solving the puzzle by answering the questions. >The big idea is using your tool to enhance step by step rationing in LLM.

I think you could use a text area for the user to indicate if the quiz is about getting the main idea or if it is about testing the details.

And for big clients, the system could be tailored so that the questions and structure reflect user intentions.


Congrats on the launch!

On your pricing model, as it's usage based, don't you incentivize your customers to use your product as little as possible? Wouldn't it be better to have limited tiers with fixed annual/monthly recurring rates? Also, do you sell to enterprise? I assume these would like this setup even more as the rates are predefined and they have a budget they have to deal with.

I'm currently developing my own pricing model and these are some issues I'm struggling with, so curious what you think.


Im currently building a billing platform, also a yc company, I've talked with a lot of different people about price strategies. If you want to talk more about in detail feel free to email me. (In my profile)

Not really a sales thing. (Im the cto, awful at sales) I just enjoy talking about pricing models with people.


Not a pricing expert by any means, but here were some of the considerations we thought of when we made that decision:

1. There's plenty of usage based infrastructure/dev tools (ie. AWS, databricks), so I don't think we incentivize minimal usage.

2. The value we're providing feels directly tied to how much testing we're running, so when we tried to construct tiers they didn't feel helpful.

3. In our experience, enterprises have been fine with usage based pricing. They're already paying for human QA / labelling on usage based terms (even if it's part of a larger fixed contract), so our pricing isn't a deviation for them.

Open to thoughts if anyone has them!


Some quick thoughts,

Usage based pricing with some sort of free tier ideally is typically is _best_ for individual developers and smaller businesses who are price sensitive but want to get off the ground quickly. Larger businesses can benefit from a usage based model but typically enterprises will want a more guaranteed price and stability quarter to quarter for planning purposes. This doesn't mean necessarily that usage based pricing doesn't work for enterprise businesses however it does mean that they typically are keen on having alerting and limits setup so that they don't accidentally go massively over budget.

One common strategy to continue with usage based pricing for enterprises but have more stability is to have them buy the usage "upfront". As an example, to use this product you may charge them "tokens" which correlate to dollars, and they can buy 1000 tokens upfront which get burned down over their time using the platform.

Another reason why enterprise plans are not typically usage based, or at least why its less common, is that many enterprises want to self host the product in their own systems in order to meet their internal security and compliance standards for their data. If that ends up being the case, self hosted products typically are not usage based because the act of sending consistent usage data back out of the enterprise's VPC adds a lot of additional complexity for the product and both business parties. Typically an annual license based structure is the model most businesses go with for self hosted.

Just to disclose a bit, I help companies do billing and figure out their best pricing strategy. Also a YC company. Congrats on the launch, the demo is technically impressive, and the value prop to me is really clear.

If you want to chat at all about it, happy to just talk for a few minutes. My email is in my profile.


As someone who uses Machine Learning to predict the presence of Talc I approve of this, even if I have no use case for it whatsoever.


I like the chevy tahoe callback - I'm assuming that's a reference to the chevy dealership that used an LLM and had people doing prompt tricks to get the chatbot to offer them a chevy tahoe for $1.

The specificity in your writing above "to make this more concrete" about how it works was also helpful for understanding the product.


You're exactly right about the chevy tahoe reference. I wasn't sure if anyone would get it. I liked that post a lot, because as much as I think LLMs are going to be useful they have limitations that we haven't solved yet.


I just tried the demo, and it looks great! Congrats on the launch!

I have a couple of questions:

1) How often do you find that the LLM fails to generate the correct question-answer pairs? The biggest challenge I'm facing with LLM-based evaluation is the variability in LLM performance. I've found that the same prompt results in different LLM responses over multiple runs. Do you have any insights on this issue and how to address it?

2) Sometimes, the domain expert generating the test set might not be well-equipped to grade the answers. Consider a customer-facing chatbot application. The RAG app might be focused on very specific user information that might be hard to verify or attest by the test set creator. Do you think there are ways to make this grading process easier?


Thanks! Max's cofounder chiming in here.

1) There's an interesting subtlety in the phrase "the correct question-answer pairs". While we don't often find factually incorrect pairs because of how we're running the pipeline, the bigger question is whether or not the pairs we generated are "the" correct ones -- if they are relevant and helpful. This takes some manual tweaking at the moment.

Inconsistent outputs over different runs are definitely an issue, but most teams we've worked with barely even have the CI/CD practice to be able to measure that rigorously. As we mature we'll aim to tackle flakiness of tests (and models) over time, but a bigger challenge has been getting regular tests like these set up in the first place.

2) In this scenario, we go to the documents powering a RAG application to both generate and grade answers. For example, the knowledge base might know that (1) product A is being recalled, and (2) customer #4 is asking for a warranty claim on product A. Using those two bits of information, we might generate a scenario that tests whether or not customer #4 gets the claim fulfilled. In other words, specific user information is simulated/used during the test set creation.


I think this could be more useful to most people as a prompt/RAG testing service rather than an llm testing service. If I ran a test and found out the llm I was using is 60% accurate on some topic what would I do with this knowledge - build a more accurate llm? Switch to another? On the other hand if a service offered me suggestions to improve accuracy by providing a score for various prompt or RAG inputs, I think this would be very useful to many people. It could even uncover a general prompting strategy depending on the underlying Llm or inputs available which would be really useful


Looks interesting. How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.

How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?

And in your examples what model do you actually use?


Great questions here!

> How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.

We support two different modes: a strict pass/fail where an answer has to have all of the information we expect, and a rubric based mode where answers are bucketed into things like "partially correct" or "wrong but helpful".

To be honest, we also get the grading wrong sometimes. If you see anything egregious please email me the topic you used at max@talc.ai

> How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?

You guessed right here - we connect to the knowledge base like a RAG app. We also use this to generate the questions -- think of it like reading questions out of a textbook to quiz someong.

> And in your examples what model do you actually use?

We use multiple models for the question generation, and are still evaluating what works best. For the demo, we are "quizzing" openai's 3.5 turbo model.


Is there a way I can give feedback on wrong labels? The easy questions seem to be correct most (all?) of the time, but I noticed a few errors in the labelling of the complex question/answers. I would love to see this improve even further!


Any feedback like this helps -- shoot me an email at max@talc.ai with the name of the topic you saw incorrect labels on.

We didn't expect this much traction on the demo, or I'd have built this functionality in!


Here's an example where the GPT response was correct, but was marked as incorrect: https://ibb.co/tMGxcf3


Thanks for flagging!


Very, very impressive.. I ran a couple of tests and on the complex it received 80% although I would say it was harsh as the answer could be said to be correct- although I found the questions generated rather simple not complex.

The 2nd test it was 100% incorrect for the complex questions! However when I checked directly with gpt-4 based upon the questions rendered it answered 100% correct. Could that be due to my custom settings in gpt4? Will run it with university students. Fascinating work


Thanks for giving it a whirl!

I agree that the current grading is a bit harsh -- the rubric we're using in this demo is fairly rudimentary. What we've seen be more helpful is a range of grades along the lines of correct / correct but unhelpful / correct but incomplete / incorrect. This somewhat depends on individual use cases though.

Let me know what questions generated you thought could be more complex! We're always working on improving our ability to explore the knowledge space for challenging questions.


Congrats on the launch! How does this compare to https://www.patronus.ai/ ? They seem to offer a very similar solution for getting on top of unpredictable LLM output


We're in a similar space, and their work on FinanceBench is great for the whole community, so I appreciate that. Otherwise there's not much out their about their product so I can't directly compare.


Now someone has to test talc AI. I can do it.

Impressive demo and business idea, congrats, good luck!


who tests the testing tool?

Thanks though -- and let us know if you hit any issues while playing around with the demo!


I think allowing other languages besides English could be a good idea.


Good idea! There's no limitation in the generation or grading, but we didn't set up the search to support this. I'll see if it's possible to enable this in the wikipedia search component.


Maybe it was already said, but I have a weird bug where system said last answer is incorrect but in the overview it still says "100% correct".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: