Yeah, I'm coming to believe that this is a much, much, much harder problem than it looks. Getting it running is pretty easy, but actually tuning the results to make them better is tricky, especially if you're not a domain expert in the area you're working on.
Evals seem like a solution, but they're very tied to specific examples, so it looks like that might be most of the issue in getting this to work, as with a good set of evals, one can actually measure performance and test out different approaches.
Embedding also seems to be a bit of a dark art in that every tutorial uses something small, but I haven't seen a lot of work on comparing the performance of particular embeddings for domain specific tasks.
This was our experience trying to deploy a knowledge base for our Org.
You have to get the domain experts to help you build evals and you need a good pipeline for testing the LLM against those as you make changes. We were never able to get there before the project was killed. Our use-case was potentially giving career altering legal advice and we only made it to roughly 80% accuracy from our very informal eval. The domain experts wanted nothing to do with actually helping build the tool. Their idea of "testing" was asking 3 softball questions and saying "yea, it's good to go".
I think on a personal level you could probably get a usable tool that works well enough most of the time. But for anything going to production where people actually depend on it, this isn't an easy problem to solve. Although, I do think its doable.
So everybody is roughly using the same method with some tweaks here and there and thus getting a similar quality in results.