Launch HN: Talc AI (YC S23) – Test Sets for AI

Imnimo · on Jan 18, 2024

I tried the demo with Cal Ripken Jr. I was surprised by some of the complex questions:

>Which MLB player won the Sporting News MLB Rookie of the Year Award as a pitcher in 1980, and who did Cal Ripken Jr. surpass to hold the record for most home runs hit as a shortstop?

>What team did Britt Burns play for in the minor leagues before making his MLB debut, and in what year did Cal Ripken Jr. break the consecutive games played record?

>Who was the minor league pitching coordinator for the Houston Astros until 2010, and what significant baseball record did Cal Ripken Jr. break in 1995?

All five questions are a combination of a question about a Britt Burns fact and an unrelated Cal Ripken fact.

Why is this? Britt Burns doesn't seem to appear on the live Wikipedia page for Ripken. Does he appear on a cached version? Or is it forming complex questions by finding another page in the same category as Ripken and pulling more facts?

maxrmk · on Jan 18, 2024

I was worried people would run into this quirk in the demo. We have several 'advanced' question generation strategies. You correctly guessed the one we're using in the demo; forming complex questions by finding another page in the same category as Ripken and pulling more facts.

Normally we pull a ton of related topic and try to pick the best, but to keep the generation fast and cost effective in the demo I limited the number of related pages we pull. So sometimes (like this case) you get something barely related and end up with odd disjointed questions.

Imnimo · on Jan 18, 2024

Ah, that does make sense - especially for a category like a baseball player where it may take a lot of other pages to find one that's truly related. Would be expensive for a demo, but not a big deal for a real evaluation.

maxrmk · on Jan 18, 2024

Yep, in real use cases the latency for generating questions doesn't really matter. But in the demo I was really worried about it.

koeng · on Jan 18, 2024

I love your demo for this. It's one of the best demos I've ever come across in a launch HN. Very easy to understand and use. It seems to suffer with more complex questions though. For example:

Question: Why does the pUC19 plasmid have a high copy number in bacterial cells?

Expected answer: The pUC19 plasmid has a high copy number due to the lack of the rop gene and a single point mutation in the origin of replication (ori) derived from the plasmid pMB1.

GPT response: The pUC19 plasmid has a high copy number in bacterial cells due to the presence of the pUC origin of replication, which allows for efficient and rapid replication of the plasmid.

Both are technically correct - the expected answer is simply more detailed about the pUC origin, but both would be considered correct. It seems difficult to test things like this, but maybe that's just not possible to really get correct.

I wonder how well things like FutureHouse's wikicrow will work for summarizing knowledge better - https://www.futurehouse.org/wikicrow - and how that could be benchmarked against Talc

matt_lee · on Jan 18, 2024

Thank you for the kind words!

One of my regrets about the demo is that we paid a lot of attention to showing off our ability to generate high quality Q/A pairs, but not nearly as much to showing what a thoughtful and thorough grading rubric can do.

It's totally possible to do a high quality grading given a rubric that sets expectations! Great implementations we've seen use categories like correct / correct but incomplete / correct but unhelpful / incorrect to better label the situation you describe. We've found that we can grade with much more nuance given a good rubric and categories, but unfortunately didn't focus on that side of things in the demo

I'm not familiar with wikicrow, will check it out!

typpo · on Jan 18, 2024

Congrats on the launch!

I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.

Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.

maxrmk · on Jan 18, 2024

Thanks! I've been following promptfoo, so I'm glad to see you here. In addition to automatic evals I think every engineer and PM using LLMs should be looking at as many real responses as they can _every day_, and promptfoo is a great way to do that.

logiduck · on Jan 18, 2024

For the chevy tahoe example, you are referencing the dealership, but in that case it wasn't a case of the implementation failing to do a positive test for fact extraction, but to test the guardrails.

Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?

I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.

Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.

maxrmk · on Jan 18, 2024

Totally - certain types of failures are much harder to test than others.

We have a couple of different test generation strategies. As you can see in the demo and examples, the most basic one is "ask about a fact".

Two of our other strategies are closer to what you're asking for:

1. tests that try to deliberately induce hallucination by implying some fact that isn't in the knowledge base. For example "do I need a pilots license to activate the flight mode on the new chevy tahoe?" implies the existence of a feature that doesn't exist (yet). This was really hard to get right, and we have some coverage here but are still improving it.

2. actively malicious interactions that try to override facts in the knowledge base. These are easy to generate.

logiduck · on Jan 18, 2024

Cool.

Just as some feedback I did the demo with the "VW Beetle" topic and one of the test cases was:

> Question: How did the introduction of the Volkswagen Golf impact the production and sales of the Beetle?

> Expected: The introduction of the Volkswagen Golf, a front-wheel drive hatchback, marked a shift in consumer preference towards more modern car designs. The Golf eventually became Volkswagen's most successful model since the Beetle, leading to a decline in Beetle production and sales. Beetle production continued in smaller numbers at other German factories until it shifted to Brazil and Mexico, where low operating costs were more important.

> GPT Response: The introduction of the Volkswagen Golf impacted the production and sales of the Beetle by gradually decreasing demand for the Beetle and shifting focus towards the Golf.

It seems that the GPT responses matches the expected but it was graded as incorrect. But it seems to me the GPT answer is correct.

In fact a couple of the other answers are marked incorrectly:

> Question: What was the Volkswagen Beetle's engine layout? > Expected Answer: Rear-engine, rear-wheel-drive layout > GPT Response: The Volkswagen Beetle had a rear-engine layout.

was marked as incorrect.

maxrmk · on Jan 18, 2024

Will take a look, thanks!

logiduck · on Jan 18, 2024

Also, just a random thing that I thought of playing around with it is a few days ago a guy posted about an AI quiz generator for education.

If you ever need to pivot, it seems like this would do reasonably well in the education space also.

maxrmk · on Jan 18, 2024

Yeah, someone is going to build this. We considered quizzing the user on the topic instead of chatgpt for our demo. It's a lot of fun to test your knowledge on any topic, but it was a worse demo because it was way less related to our current product.

I think that one of the obvious next big spaces for LLMs is education. I already find chatgpt useful when learning myself. That being said, I'm terrified of trying to sell things to schools.

andy99 · on Jan 18, 2024

The first thing that popped into my head is what do you do with the test results? Specifically, how do they feed back into model improvement in a way that avoids overfitting? Do you think having some kind of classical "holdout" question set is enough? Especially with RAG, I'd wonder with the levers that are available (prompt, chunking strategy, ...) if you define a bunch of test questions do you end up overfitting to them, or to the current data set. How can findings be extrapolated to new situations?

maxrmk · on Jan 18, 2024

Ah! Think of this more like software testing that goes in CI/CD rather than an ML test or validation set. We're providing this testing for applications built on top of language models.

For example if you're a SWE working on bing chat, you can make a change to how retrieval works and quickly know how it affected accuracy on a range of different test scenarios. This kind of evaluation is done by contractors today, and they are slow and inaccurate.

nicolewhite · on Jan 19, 2024

Pretty neat!

I have a question about how you intend to deal with LLM applications where the output is more creative, e.g. an app where the user input is something like "write me a story about X" and the LLM app is using a higher temperature to get more creative responses. In these cases I don't think it's possible to represent the ideal output as a single string -- it would need to be a more complicated schema, like a list of constraints for the output, e.g. that it contains certain substrings.

TIA!

robrenaud · on Jan 19, 2024

The TinyStories[1] paper has an interesting solution for how to evaluate stories. They ask GPT-4 to grade them on grammar, consistency, and creativity.

This seems like it would be extremely hard to figure out how to do automatically though.

[1] https://arxiv.org/pdf/2305.07759.pdf

maxrmk · on Jan 19, 2024

Good question! We aren't really focusing on this area, but I'm willing to speculate.

I'd expect broaded constraints than just substring matching. For example, if the user requests that a certain plot point in the story occur before another, we should actually be able to (1) generate a test for that behavior and (2) use a model to check if the request was followed.

I'd expect other tests might be useful too -- checking for things like "no generation of violent content, even if the user requests it".

moinism · on Jan 18, 2024

Congrats on the launch! Just tried the demo and it looks impressive. Good luck.

Are you by any chance hiring global-remote, full-stack/front-end devs? Would love to work with you guys.

maxrmk · on Jan 18, 2024

Thanks! We aren't hiring right now, but if you shoot me an email at max@talc.ai I'll follow up in a few months.

bestai · on Jan 18, 2024

I think for you idea to have traction (1) the questions should be selected by their importance and (2) the questions should be chained to allow new results. Just for inspiration or example you could create a quiz for solving a puzzle and at the same time solving the puzzle by answering the questions. >The big idea is using your tool to enhance step by step rationing in LLM.

I think you could use a text area for the user to indicate if the quiz is about getting the main idea or if it is about testing the details.

And for big clients, the system could be tailored so that the questions and structure reflect user intentions.

julesvr · on Jan 18, 2024

Congrats on the launch!

On your pricing model, as it's usage based, don't you incentivize your customers to use your product as little as possible? Wouldn't it be better to have limited tiers with fixed annual/monthly recurring rates? Also, do you sell to enterprise? I assume these would like this setup even more as the rates are predefined and they have a budget they have to deal with.

I'm currently developing my own pricing model and these are some issues I'm struggling with, so curious what you think.

twosdai · on Jan 19, 2024

Im currently building a billing platform, also a yc company, I've talked with a lot of different people about price strategies. If you want to talk more about in detail feel free to email me. (In my profile)

Not really a sales thing. (Im the cto, awful at sales) I just enjoy talking about pricing models with people.

matt_lee · on Jan 18, 2024

Not a pricing expert by any means, but here were some of the considerations we thought of when we made that decision:

1. There's plenty of usage based infrastructure/dev tools (ie. AWS, databricks), so I don't think we incentivize minimal usage.

2. The value we're providing feels directly tied to how much testing we're running, so when we tried to construct tiers they didn't feel helpful.

3. In our experience, enterprises have been fine with usage based pricing. They're already paying for human QA / labelling on usage based terms (even if it's part of a larger fixed contract), so our pricing isn't a deviation for them.

Open to thoughts if anyone has them!

twosdai · on Jan 19, 2024

Some quick thoughts,

Usage based pricing with some sort of free tier ideally is typically is _best_ for individual developers and smaller businesses who are price sensitive but want to get off the ground quickly. Larger businesses can benefit from a usage based model but typically enterprises will want a more guaranteed price and stability quarter to quarter for planning purposes. This doesn't mean necessarily that usage based pricing doesn't work for enterprise businesses however it does mean that they typically are keen on having alerting and limits setup so that they don't accidentally go massively over budget.

One common strategy to continue with usage based pricing for enterprises but have more stability is to have them buy the usage "upfront". As an example, to use this product you may charge them "tokens" which correlate to dollars, and they can buy 1000 tokens upfront which get burned down over their time using the platform.

Another reason why enterprise plans are not typically usage based, or at least why its less common, is that many enterprises want to self host the product in their own systems in order to meet their internal security and compliance standards for their data. If that ends up being the case, self hosted products typically are not usage based because the act of sending consistent usage data back out of the enterprise's VPC adds a lot of additional complexity for the product and both business parties. Typically an annual license based structure is the model most businesses go with for self hosted.

Just to disclose a bit, I help companies do billing and figure out their best pricing strategy. Also a YC company. Congrats on the launch, the demo is technically impressive, and the value prop to me is really clear.

If you want to chat at all about it, happy to just talk for a few minutes. My email is in my profile.

tommykins · on Jan 18, 2024

As someone who uses Machine Learning to predict the presence of Talc I approve of this, even if I have no use case for it whatsoever.

tikkun · on Jan 18, 2024

I like the chevy tahoe callback - I'm assuming that's a reference to the chevy dealership that used an LLM and had people doing prompt tricks to get the chatbot to offer them a chevy tahoe for $1.

The specificity in your writing above "to make this more concrete" about how it works was also helpful for understanding the product.

maxrmk · on Jan 18, 2024

You're exactly right about the chevy tahoe reference. I wasn't sure if anyone would get it. I liked that post a lot, because as much as I think LLMs are going to be useful they have limitations that we haven't solved yet.

pchunduri6 · on Jan 18, 2024

I just tried the demo, and it looks great! Congrats on the launch!

I have a couple of questions:

1) How often do you find that the LLM fails to generate the correct question-answer pairs? The biggest challenge I'm facing with LLM-based evaluation is the variability in LLM performance. I've found that the same prompt results in different LLM responses over multiple runs. Do you have any insights on this issue and how to address it?

2) Sometimes, the domain expert generating the test set might not be well-equipped to grade the answers. Consider a customer-facing chatbot application. The RAG app might be focused on very specific user information that might be hard to verify or attest by the test set creator. Do you think there are ways to make this grading process easier?

matt_lee · on Jan 18, 2024

Thanks! Max's cofounder chiming in here.

1) There's an interesting subtlety in the phrase "the correct question-answer pairs". While we don't often find factually incorrect pairs because of how we're running the pipeline, the bigger question is whether or not the pairs we generated are "the" correct ones -- if they are relevant and helpful. This takes some manual tweaking at the moment.

Inconsistent outputs over different runs are definitely an issue, but most teams we've worked with barely even have the CI/CD practice to be able to measure that rigorously. As we mature we'll aim to tackle flakiness of tests (and models) over time, but a bigger challenge has been getting regular tests like these set up in the first place.

2) In this scenario, we go to the documents powering a RAG application to both generate and grade answers. For example, the knowledge base might know that (1) product A is being recalled, and (2) customer #4 is asking for a warranty claim on product A. Using those two bits of information, we might generate a scenario that tests whether or not customer #4 gets the claim fulfilled. In other words, specific user information is simulated/used during the test set creation.

thruway516 · on Jan 19, 2024

I think this could be more useful to most people as a prompt/RAG testing service rather than an llm testing service. If I ran a test and found out the llm I was using is 60% accurate on some topic what would I do with this knowledge - build a more accurate llm? Switch to another? On the other hand if a service offered me suggestions to improve accuracy by providing a score for various prompt or RAG inputs, I think this would be very useful to many people. It could even uncover a general prompting strategy depending on the underlying Llm or inputs available which would be really useful

sherlock_h · on Jan 18, 2024

Looks interesting. How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.

How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?

And in your examples what model do you actually use?

maxrmk · on Jan 18, 2024

Great questions here!

> How do you rate the correctness? Some complex LLM answers seemed to be correct but not in as much detail as the expected answer.

We support two different modes: a strict pass/fail where an answer has to have all of the information we expect, and a rubric based mode where answers are bucketed into things like "partially correct" or "wrong but helpful".

To be honest, we also get the grading wrong sometimes. If you see anything egregious please email me the topic you used at max@talc.ai

> How do you generate the answers? Does the model have access to the original source of truth (like in RAG apps)?

You guessed right here - we connect to the knowledge base like a RAG app. We also use this to generate the questions -- think of it like reading questions out of a textbook to quiz someong.

> And in your examples what model do you actually use?

We use multiple models for the question generation, and are still evaluating what works best. For the demo, we are "quizzing" openai's 3.5 turbo model.

maho · on Jan 19, 2024

Is there a way I can give feedback on wrong labels? The easy questions seem to be correct most (all?) of the time, but I noticed a few errors in the labelling of the complex question/answers. I would love to see this improve even further!

maxrmk · on Jan 19, 2024

Any feedback like this helps -- shoot me an email at max@talc.ai with the name of the topic you saw incorrect labels on.

We didn't expect this much traction on the demo, or I'd have built this functionality in!

dkindler · on Jan 18, 2024

Here's an example where the GPT response was correct, but was marked as incorrect: https://ibb.co/tMGxcf3

maxrmk · on Jan 18, 2024

Thanks for flagging!

Robotenomics · on Jan 18, 2024

Very, very impressive.. I ran a couple of tests and on the complex it received 80% although I would say it was harsh as the answer could be said to be correct- although I found the questions generated rather simple not complex.

The 2nd test it was 100% incorrect for the complex questions! However when I checked directly with gpt-4 based upon the questions rendered it answered 100% correct. Could that be due to my custom settings in gpt4? Will run it with university students. Fascinating work

matt_lee · on Jan 18, 2024

Thanks for giving it a whirl!

I agree that the current grading is a bit harsh -- the rubric we're using in this demo is fairly rudimentary. What we've seen be more helpful is a range of grades along the lines of correct / correct but unhelpful / correct but incomplete / incorrect. This somewhat depends on individual use cases though.

Let me know what questions generated you thought could be more complex! We're always working on improving our ability to explore the knowledge space for challenging questions.

ore0s · on Jan 19, 2024

Congrats on the launch! How does this compare to https://www.patronus.ai/ ? They seem to offer a very similar solution for getting on top of unpredictable LLM output

maxrmk · on Jan 19, 2024

We're in a similar space, and their work on FinanceBench is great for the whole community, so I appreciate that. Otherwise there's not much out their about their product so I can't directly compare.

quadcore · on Jan 18, 2024

Now someone has to test talc AI. I can do it.

Impressive demo and business idea, congrats, good luck!

maxrmk · on Jan 18, 2024

who tests the testing tool?

Thanks though -- and let us know if you hit any issues while playing around with the demo!

bestai · on Jan 18, 2024

I think allowing other languages besides English could be a good idea.

maxrmk · on Jan 18, 2024

Good idea! There's no limitation in the generation or grading, but we didn't set up the search to support this. I'll see if it's possible to enable this in the wikipedia search component.

Departed7405 · on Jan 19, 2024

Maybe it was already said, but I have a weird bug where system said last answer is incorrect but in the overview it still says "100% correct".