Hacker Newsnew | past | comments | ask | show | jobs | submit | jeffreyip's commentslogin

You sure can! A few lines of code is all it takes, and a few simple rules to follow as shown here: https://docs.confident-ai.com/guides/guides-building-custom-...

If you're using DSPy, you can also include it directly in this custom metric from the link above. It's hard for me to say 100% if there are advantages of doing this within DeepEval, but 8/10 times running evals in our ecosystem brings you more benefits than drawbacks. Let me know if you have trouble setting up!


Definitely, feel free to join our discord for any questions on it: https://discord.com/invite/a3K9c8GRGt


Do check it out, the early feedback has been great: https://docs.confident-ai.com/docs/metrics-dag


Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?

Do do synthetic data generation for custom application use cases. Such as RAG, summarization, text-sql, etc. We call this module the "synthesizer", and you can customize your data generation pipeline however you want (I think, let me know otherwise!).

Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.


>Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?

Short version: if a model can answer a very high proportion of questions from a benchmark accurately then the next step is to ask it two or more questions at a time. On some models the quality of answers varies dramatically with which is asked first.

>Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.

Very good start, but the statistics of the generated text matter a lot.

As an example on a dumb as bricks benchmark I've designed I can saturate the reasoning capabilities of all non-reasoning models just by varying the names of objects in the questions. A model that could get a normalized score of 14 with standard object strings could get a score as high as 18 with one letter strings standing for objects and as low as zero with arbitrary utf-8 character strings - which turns out mattered a lot since all the data was polluted with international text coming from the stock exchanges.

Feel free to drop me a line if you're interested in a more in depth conversation. LLMs are _ridiculously_ under tested for how many places they show up in.


Hey yes would definitely love to, my contact info is in my bio, please drop me an email :)


It's actually langfuse.com! Our quickstart walks you through the whole process: https://docs.confident-ai.com/confident-ai/confident-ai-intr...


That's great! Hope you enjoyed it :)


Thanks and great question! There's a ton of eval tools out there but there are only a few that actually focuses on evals. The quality of LLM evaluation depends on the quality of dataset and the quality of metrics, and so tools that are more focused on the platform side of things (observability/tracing) tend to fall short on the ability to do accurate and reliable benchmarking. What tends to happen for those tools are users use them for one-off debugging, but when errors only happen 1% of the time, there is no capability for regression testing.

Since we own the metrics and the algorithms that we've spent the last year iterating on with our users, we balance between giving engineers the ability to customize our metric algorithms and evaluation techniques, while offering the ability for them to bring it to the cloud for their organization when they're ready.

This brings me to the tools that does have their own metrics and evals. Including us, there's only 3 companies out there that does this to a good extent (excuse me for this one), and we're the only one with a self-served platform such that any open-source user can get the benefit of Confident AI as well.

That's not all the difference, because if you were to compare DeepEval's metrics on more nuance details (which I think is very important), we provide the most customizable metrics out there. This includes researched-backed SOTA LLM-as-a-judge G-Eval for any criteria, and the recently released DAG metric that is a decision-based that is virtually deterministic despite being LLM-evaluated. This means as user's use cases get more and more specific, they can stick with our metrics and benefit from DeepEval's ecosystem as well (metric caching, cost tracking, parallelization, integrated with Pytest for CI/CD, Confident AI, etc)

There's so much more, such as generating synthetic data to get started with testing even if you don't have a prepared test set, red-teaming for safety testing (so not just testing for functionality), but I'm going to stop here for now.


I see, although most users come to us for evaluating LLM applications, you're correct that the academic benchmarking of foundational models is also offered in DeepEval, which I'm assuming what you're talking about.

We actually designed it to make it easily work off any API. How it works is you just have to create a wrapper around your API and you're good to go. We take care of the async/concurrent handling of such benchmarking so the evaluation speed is really just limited by the rate limit of your LLM API.

This link shows what a wrapper looks like: https://docs.confident-ai.com/guides/guides-using-custom-llm...

And once you have your model wrapper setup, you can use any benchmark we provide.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: