Mildly surprised to see no mention of my top 2 LLM fails:
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).
> Sample many times and vote is a highly effective (but slow) strategy.
Beam search[1] has long been a great way to sample from language models, even before transformers. Essentially you keep track of the top N most promising threads and sample randomly from those.
OpenAI doesn't offer beam search yet, just temperature and top_k, but I hope they add support for it because it's far more efficient than just starting over each time.
There’s a few subtle misconceptions being spread here:
1) Hallucination rate is not inversely proportional to number of samples, unless you assume statistical independence. As you’re sampling from the same generative process each time, any inherent bias of the LLM could affect every sample (eg see golden gate Claude). Naively calculating hallucination rate as P^N is going to be a massive underestimate of the true error rate for many tasks requiring factual accuracy.
2) You’re right that output tokens are generated autoregressively, but you are thinking like a human. Transformer attention layers are permutation invariant. The ordering of output (eg decision first then justification later) is inconsequential, either can be derived from input context and hidden state where there is no causal masking of attention.
Justification before decision still works out better in practice, though, because of chain of thought [1]. You'll tend to get more accurate and better-justified decisions.
With decision before justification, you tend to have a greater risk of the output being a wrong decision followed by convincing BS justifying it.
(edit: Another way you could think of it is, LLMs still can't violate causality. Attention heads' ability to look in both directions with respect to a particular token's position in the sequence does not enable them to see into the future and observe tokens that don't exist yet.)
I totally agree, that's what I had to do with my patchbot that evaluates haproxy patches to be backported ( https://github.com/haproxy/haproxy/tree/master/dev/patchbot/ ). Originally it would just provide a verdict and justify it and it worked extremely poorly, often with a justification that directly contradicted the verdict. I swapped that by asking the analysis and the final verdict and now the success rate is totally amazing (particularly with mistral that remains unbeatable at this task by obeying extremely well to instructions).
Your second point I either don’t correctly understand or seems to fly in the face of a lot of proven techniques. Chain-of-thought, react, decision transformer all showcase that order of output of an LLM matter because the tokens output by the LLM before the “answer” can nudge the model to sample from a higher quality part of the distribution for the remainder of its output
> There is almost zero value in evaluating a prompt by only running it once.
To the user
But these tools are marketed as if you do only need to run them once to get a good result; The companies behind them would really want you to stop hammering the button that deletes their money.
As an aside:
> For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
This isn't really true, and requires you to fuzz the prompt itself for best effect. Making the "spam the LLM with requests" problem much worse.
You don't need to hit a LLM multiple times to get multiple distributions, just provide a list of perspectives and ask the model to answer the question from each of them in turn, then combine the results right there in the prompt. I have tested this approach a bunch, it works.
> You don't need to hit a LLM multiple times to get multiple distributions
This isn't correct.
You're just sampling a different distribution.
You can adjust the shape of the distribution with your prompt; certainly... and if you make a good prompt, perhaps, you can narrow the 'solution space' that you sample into.
...but, you're still sampling randomly into a distribution, and the N'th token relies on the (N-1)'th token as an input; that means that a random deviance to a bad solution is incrementally responsible for a bad solution, regardless of your prompt.
...
Consider the prompt "Your name is Pete. What is your name?"
Seems like a fairly narrow distribution right?
However, there's a small chance that the first generated token is 'D'; it's small, but non-zero. That means it happens from time to time. The higher the temperature, the higher the randomization of the output tokens.
How do you imagine that completion runs when it happens? Doug? Dane? Danial? Dave? Don't know? I tell you what it is not; it's not Pete.
That's the issue here; when you sample, the solution space is wide, and any single sample has a P chance of being a stupid hallucination.
When you sample multiple times, the chance of that hallucination is P * P * P * P, etc. by the number of time you sample.
You can therefore control your error rate this way, because, you can calculate the chance of failure as P^N.
Yes, obviously, if your P(good answer) < P(bad answer) it has the opposite effect.
...but no, sampling once with a single prompt does not save you from this prompt no matter what or how good your prompt is.
Furthermore, when you evaluating prompts, only sampling once means you have no way of knowing if it was a good prompt or not. While, if you sample say, 10 times, you can see that obviously, from the outputs (eg. Pete, Pete, Pete, Pete, Potato, Pete, Pete <--- ) what the prompt is doing.
You can measure the error rate in your prompts this way.
If you don't, honestly, you really have no idea if your prompts are any good at all. You're just guessing.
People who run a prompt, tweak it, run it, tweak it, run it, tweak it, etc. are observing random noise, not doing prompt engineering.
I suggest you spend 20 hours evaluating the results of 10 prompts vs 1 prompt with multiple perspectives to learn the truth about the matter rather than trying to armchair expert.
Edit in response to your wall of text: I have *extensively* tested the results of multi-shot prompting vs repeated single shot prompting, and the differences between them are not material to the outcome of "averaging" results, or selecting the best result. You can theorize all you want, but the real world would like a word.
That's an early step that matters more when hitting chat with hidden temperature. Once you get a prompt dialed in you usually want to lower the temperature to the minimum value that still produces the desired results.
I will say though, using temperature 0 without understanding it (or worse, testing at temp > 0 and then setting temp to 0 for production, which I literally had to stop someone I know and respect as a developer from doing) and not understanding what top_k and top_n do (but using them anyway) is my #3 for LLM fails.
/shrug
...but, yes as you say, in a trivial case, like binary decision making, a 0 or very low temperature can help with the need to multiple sample; and as you say, when it's deterministic, sampling multiple times doesn't help at all.
What are some good metrics to evaluate LLM output performance in general? Or is it too hard to quantify at this stage (or not understood well enough). Perhaps the latter, or else those could be in the loss function itself..
We allude to 2 when talking about using explanations first, but I totally agree. One minor comment is explanations after can sometimes be useful for understanding how the model came to a particular generation during post-hoc evals.
Point 1 is also a good callout. I added something on this for llm judge but it’s relevant more broadly.
If you set temperature to zero the output will be always the same, not a distribution. If instead you increase temperature the LLM will sometimes choose other tokens than the one with the highest score, but it won’t be that much different
>> If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
The core issue that parent is talking about is that the decision-tokens should built on the reasoning-tokens vs the reasoning-tokens are generated according to the decison-tokens. RAG just provides the context the LLM should reason about.
Pretty good. Despite my high scepticism of the technology I have spent the last year working with LLMs myself. I would add a few things.
The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.
There is power beyond the conversational aspects of LLMs. Always ask, do you need to pass the actual text back to your user or can you leverage the LLM and constrain what you return?
LLMs are the best tool we've ever had for understanding user intent. They obsolete the hierarchies of decision trees and spaghetti logic we've written for years to classify user input into discrete tasks (realizing this and throwing away so much code has been the joy of the last year of my work).
Being concise is key and these things suck at it.
If you leave a user alone with the LLM, some users will break it. No matter what you do.
This has been really interesting read. Aligned that if you leave a user along with LLM, some one will break. Hence we choose to use large number of templates wherever suitable as compared to a free reign for LLM to respond with.
In my opinion, using templates can help keep responses reliable. But it can also make interactions feel robotic, diminishing the "wow" factor of LLMs. There might be better options out there that we haven't found yet.
Absolutely. This is a huge trade-off. The constraints you place on the model output is all about how much your app and user experience can tolerate bad LLM behavior.
>The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.
I really like this analogy! That sums up my experiences with LLMs as well.
Hello this is Hamel, one of the authors (among the list of other amazing authors). Happy to answer any questions as well as tag any of my colleagues to answer any questions!
(Note: this is only Part 1 of 3 of a series that has already been written and the other 2 parts will be released shortly)
I would like to know your opinion about grafRAG and the ontology. Knowledge Graphs (KG) are a game changer for companies with a lot of unstructured data in the context of applying them with LLM
This is a fantastic article, and I've already shared it with a few colleagues. I'm wondering if you prefer to provide few-shot examples in a single "message" or in a simulated back-and-forth conversation between the user and the assistant?
It was a great read, aligned on many thought processes from our own tinkering in breaking down tasks for LLM. Eagerly look forward to the next 2 parts, this one has been educational.
I feel like an insane person everytime I look at the LLM development space and see what the state of the art is.
If I'm understanding this correctly, the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output. RAG also seems like a hilariously thin wrapper over traditional search systems, and it still might hallucinate in that tiny distance between the search result and the user. Like we're talking about writing sentences and coaching what amounts to an auto complete system to magically give us something we want. How is this industry getting hundreds of billions of dollars in investment?
Also the error rate is about 5-10% according to this article. That's pretty bad!
> [...] the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output.
No, that would be very inefficient. At each token generation step, the LLM provides a likelihood for all the defined token based on the past context. The structured output is defined by a grammar, which defines the legal tokens for the next step. You can then take the intersection of both (ignore any token not allowed by the grammar), and then select among the authorized token based on the LLM likelihood for them in the usual way. So it's a direct constraint, and it's efficient.
> Also the error rate is about 5-10% according to this article. That's pretty bad!
Having 90-95% success rate on something that was previously impossible is acceptable. Without LLMs the success rate would be 0% for the things I'm doing.
I think the problem here is that that is often still not that acceptable. Let's imagine a system with say, 100 million users making 25 queries a day, just to give us some contrived numbers to examine. At a 10% error rate that's 250 million mistakes a day, or 75 million if we're generous and say there's a 3% error rate. Then you have to think about your application, how easily you can detect issues, how much money you're willing to pay your ops staff (and how big you want to expand it), the cost of the mistakes themselves as well as the legal and retutational costs of having an unreliable system. Take those costs, add it to the cost to run this system (probably considerable), and you're coming up on a heuristic for figuring out if possible equates worth doing. 75 million times any dollar amount (plus 2.5 billion total queries you need to run the infrastructure for) is still a lot of capital. If each mistake costs you $0.20 (I made this number up), then maybe $5.5b a year is worth the cost? I'm not sure.
It's probable that Google is in the middle of doing this napkin math given all the embarrassing stuff we saw last week. So it's cool that we're closer to solving these really hard problems but whether they're acceptable is a more complicated question than just it used to not be possible. Maybe that math works out in your favor for your application.
Google is so terrified that someone is threatening their market position, the one in which they have over $100b in cash and get something like $20b in profit quarterly, that they're willing to shove this technology into some of the most important infrastructure on the internet so they can get fucksmith to tell everyone to put glue in their pizza sauce. I'll never understand how a company in maybe one of the most secure financial situations in all of human history has leadership that is this afraid.
Via APIs, yes. But if you have direct access to the model you can use libraries like https://github.com/guidance-ai/guidance to manipulate the output structure directly.
I've been building out an AI/LLM-based feature at work for a while now and, yeah, from my POV it's completely useless bullshit that only exists because our CTO is hyped by the technology and our investors need to see "AI" plastered somewhere on our marketing page, regardless of how useful it is in real use. Likewise with any of the other LLM products I've seen out in the wild as well, it's all just a hypewave being pushed by corps and clueless C-suites who hear other C-suites fawning over the tech.
It's so painful. We have funders come to us saying they love what we do, they want us to do more of it, they have $X million to invest, but only if we use "AI." Investers have their new favorite hammer and, by gosh, you better use it, even if you're trying to weld a pipe.
this is my fear regarding AI - it doesn't have to be as good as humans, it just has to be cheaper and it will get implemented in business processes. overall quality of service will degrade while profit margins increase.
The point was that for many tasks, AI has similar failure rates compared to humans while being significantly cheaper. The ability for human error rates to be reduced by spending even more money just isn't all that relevant.
Even if you had to implement checks and balances for AI systems, you'd still come away having spent way less money.
Upon loading the site, a chat bubble pops up and auto-plays a loud ding. Is the innovation of LLMs really a regression to 2000s spam sites? Can’t say I’m excited.
Surely step one is carefully consider whether LLMs are the solution to you problem? That to me is the part where this is likely to go wrong for most people
Do you mean as an additional step to determine whether the content the rag wants to pull is actually relevant? Or as a filter of sorts as to hat projects to work on?
> Thus, you may expect that effective prompting for Text-to-SQL should include structured schema definitions; indeed.
I found that the simpler the better, when testing lots of different SQL schema formats on https://www.sqlai.ai/. CSV (table name, table column, data type) outperformed both a JSON formatted and SQL schema dump. And not to mention consumed fewer tokens.
If you need the database schema in a consistent format (e.g. CSV) just have LLM extract data and convert whatever the user provides into CSV. It shines at this.
Interesting, thanks for sharing! I was wondering about this. When starting out with feeding Tabular data, I instinctively went with CSV, but always worried: What if there is a better choice? What if longer tables, the LLMs forgets the column order?
That's exactly what happened to me when I tried to get the open-source models to extract a CSV from textual data with a lot of yes/no fields; i.e., the model forgot the column order and started confusing the cell values. I found I had to use more powerful models like Mistral Large or ChatGPT. So I think that is a valid thing to worry about with smaller models, but maybe less of a concern with larger ones.
One thing I am getting from this is that you need to be able to write prompts using well-structured English. That may be a challenge to a significant percentage of the population.
I am curious to know if the authors tried to build LLMs in languages other than English and what did they learn while doing so?
An excellent post reminding me of the best O'Reilly articles from the past. Looking forward to parts 2 and 3.
Well, it's early in the morning and I have not had my coffee, yet. English being my second language doesn't help either :-) Probably best for me to wait until I wake up before writing a prompt.
One thing that wasn't mentioned that works pretty well - if you have a RAG process running async rather than in a REPL loop, you can retrieve documents then perform a pass with another LLM to do summarization/extraction first. This saves input token costs for expensive LLMs, and lets you cram more information in the context, you just have to deal with additional latency.
This is excellent and matches with my experience, especially the part about prioritizing deterministic outputs. They are not as sexy as agentic chain of thought, but they actually work.
Comprehensive and practical write-up that aligns with most of my experiences.
One controversial point that has led to discussions in my team is this:
> A common anti-pattern/code smell in software is the “God Object,” where we have a single class or function that does everything. The same applies to prompts too.
In theory, a monolithic agent/prompt with infinite context size, a large toolset, and perfect attention would be ideal.
Multi-agent systems will always be less effective and more error-prone than monolithic systems on a given problem because of less context of the overall problem. Individual agents work best when they have entirely different functionalities.
As an OpenAI employee who has worked with dozens of API customers, I mostly agree with the article's tip to break up tasks into smaller, more reliable subtasks.
If each step of your task requires knowledge of the big picture, then yeah it ought to help to put all your context into a single API call.
But if you can decompose your task into relatively independent subtasks, then it helps to use a custom prompt/custom model for each of those steps. Extraneous context and complexity are just opportunities for the model to make mistakes, and the more you can strip those out, the better. 3 steps with 99% reliability are better than 1 step with 90% reliability.
Of course, it all depends on what you're trying to do.
I'd say single, big API calls are better when:
- Much of the information/substeps are interrelated
- You want immediate output for a user-facing app, without having to wait for intermediate steps
Multiple, sequenced API calls are better when:
- You can decompose the task into smaller steps, each of which do not require full context
- There's a tree or graph of steps, and you want to prune irrelevant branches as you proceed from the root
- You want to have some 100% reliabile logic live outside of the LLM in parsing/routing code
- You want to customize the prompts based on results from previous steps
smaller tasks also helps in choosing smaller models to work with, instead of waiting for a large model to respond (really not usable when doing customer facing work)
I wonder how helpful “prompt unit tests” would be here?
Write the initial prompt, and write some tests to validate the output of the prompt. Then as the prompt grows you can observe the decline in performance at the initial task. Then you can decide if the new larger prompt is worth the decline in performance at it’s initial task.
It might not work for all tasks, but a good candidate would be - write SQL queries from natural language.
hey there, Hugo here and big fan of this work. Such a fan I'm actually doing a livestream podcast recording with all the authors here, if you're interested in hearing more from them: https://lu.ma/e8huz3s6?utm_source=hn
Can anyone recommend resources, preferably books, on this whole topic of building applications around LLMs? It feels like running after an accelerating train to hop on.
Thanks for sharing, I've followed these authors for a while and they're great.
Some notes from my own experience on LLMs for NLP problems:
1) The output schema is usually more impactful than the text part of a prompt.
a) Field order matters a lot. At inference, the earlier tokens generated influence the next tokens.
b) Just have the CoT as a field in the schema too.
c) PotentialField and ActualField allow the LLM to create some broad options and then select the best. This mitigates the fact that they can't backtrack a bit. If you have human evaluation in your process, this also makes it easier for them to correct mistakes.
d) Most well definined problems should be possible zero-shot on a frontier model. Before rushing off to add examples really check that you're solving the correct problem in the most ideal way.
2) Defining the schema as typescript types is flexible and reliable and takes up minimal tokens. The output JSON structure is pretty much always correct (as long as the it fits in the context window) the only issue is that the language model can pick values outside the schema but that's easy to validate in post.
3) "Evaluating LLMs can be a minefield." yeah it's a pain in the ass.
4) Adding too many examples increases the token costs per item a lot. I've found that it's possible to process several items in one prompt and, despite it being seemingly silly and inefficient, it works reliably and cheaply.
5) Example selection is not trivial and can cause very subtle errors.
6) Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).
Love the idea of adding CoT as a field in the expected structured output as it also makes it easier from a UX perspective to show/hide internal vs external outputs.
> Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).
Would be neat to see LLM-specific adapters that can be used to swap out different formats within the prompt.
"Ready to -dive- delve in?" is an amazingly hilarious reference. For those who don't know, LLMs (especially ChatGPT) use the word delve significantly more often than human created content. It's a primary tell-tale sign that someone used an LLM to write the text. Keep an eye out for delving, and you'll see it everywhere.
Fantastic advice. While reading the article I kept running across advice I had seen before or figured out myself, then forgot about. I am going to summarize this article and add the summary to my own Apple Notes (there are better tools, but I just use Apple Notes to act as a pile-of-text for reach notes.)
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).