For the chevy tahoe example, you are referencing the dealership, but in that case it wasn't a case of the implementation failing to do a positive test for fact extraction, but to test the guardrails.
Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?
I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.
Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.
Totally - certain types of failures are much harder to test than others.
We have a couple of different test generation strategies. As you can see in the demo and examples, the most basic one is "ask about a fact".
Two of our other strategies are closer to what you're asking for:
1. tests that try to deliberately induce hallucination by implying some fact that isn't in the knowledge base. For example "do I need a pilots license to activate the flight mode on the new chevy tahoe?" implies the existence of a feature that doesn't exist (yet). This was really hard to get right, and we have some coverage here but are still improving it.
2. actively malicious interactions that try to override facts in the knowledge base. These are easy to generate.
Just as some feedback I did the demo with the "VW Beetle" topic and one of the test cases was:
> Question: How did the introduction of the Volkswagen Golf impact the production and sales of the Beetle?
> Expected: The introduction of the Volkswagen Golf, a front-wheel drive hatchback, marked a shift in consumer preference towards more modern car designs. The Golf eventually became Volkswagen's most successful model since the Beetle, leading to a decline in Beetle production and sales. Beetle production continued in smaller numbers at other German factories until it shifted to Brazil and Mexico, where low operating costs were more important.
> GPT Response: The introduction of the Volkswagen Golf impacted the production and sales of the Beetle by gradually decreasing demand for the Beetle and shifting focus towards the Golf.
It seems that the GPT responses matches the expected but it was graded as incorrect. But it seems to me the GPT answer is correct.
In fact a couple of the other answers are marked incorrectly:
> Question: What was the Volkswagen Beetle's engine layout?
> Expected Answer: Rear-engine, rear-wheel-drive layout
> GPT Response: The Volkswagen Beetle had a rear-engine layout.
Yeah, someone is going to build this. We considered quizzing the user on the topic instead of chatgpt for our demo. It's a lot of fun to test your knowledge on any topic, but it was a worse demo because it was way less related to our current product.
I think that one of the obvious next big spaces for LLMs is education. I already find chatgpt useful when learning myself. That being said, I'm terrified of trying to sell things to schools.
Aren't the guardrail tests much harder since they are open-ended and have to guard against unknown prompt injections and the test of facts much simpler?
I think a test suite that guards against the infinite surface area is more valuable then testing if a question matches a reference answer.
Interested to how you view testing against giving a wrong answer outside of the predefined scope as opposed to testing that all the test questions match a reference.