Because RAGs are simply a list of vectors and a similarity search with some vari...

disgruntledphd2 · on Oct 31, 2024

Yeah, I'm coming to believe that this is a much, much, much harder problem than it looks. Getting it running is pretty easy, but actually tuning the results to make them better is tricky, especially if you're not a domain expert in the area you're working on.

Evals seem like a solution, but they're very tied to specific examples, so it looks like that might be most of the issue in getting this to work, as with a good set of evals, one can actually measure performance and test out different approaches.

Embedding also seems to be a bit of a dark art in that every tutorial uses something small, but I haven't seen a lot of work on comparing the performance of particular embeddings for domain specific tasks.

bronco21016 · on Nov 2, 2024

This was our experience trying to deploy a knowledge base for our Org.

You have to get the domain experts to help you build evals and you need a good pipeline for testing the LLM against those as you make changes. We were never able to get there before the project was killed. Our use-case was potentially giving career altering legal advice and we only made it to roughly 80% accuracy from our very informal eval. The domain experts wanted nothing to do with actually helping build the tool. Their idea of "testing" was asking 3 softball questions and saying "yea, it's good to go".

I think on a personal level you could probably get a usable tool that works well enough most of the time. But for anything going to production where people actually depend on it, this isn't an easy problem to solve. Although, I do think its doable.