Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.
One example is in finance, you have a lot of 45 page PDFs laying around and you're pretty sure one of them has the Reg, or info you need. You aren't sure which so you open them one by one and do a search for a word, then jump through a bunch of those results and decide it's not this PDF. You do that till you find the "one". There are a non trivial amount of Executive level jobs that pretty much do this for half of their work week.
This is true for traditional full-text document search as well.
When most people mention RAG, they’re using a vector store to surface results that are semantically similar to the user’s query (the retrieval part). They then pass these results to an LLM for summary (the generation part).
In practice, the problems with RAG are similar to the traditional problems of search: indices, latency, and correctness.
Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
* Latency
I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
* Correctness
They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
---
I think that vector-space at least bring some big advantages for indexing here, being able to search for more abstract concepts.
> Doesn't vector search solve a lot of these problems? These AI vector spaces seem like a really easy win here, and they're reasonably lightweight compared to a full LLM.
Yes and no. What do you vectorize? The whole document? The whole page? The whole paragraph? How you split your data, and then index into it, is still problem-space dependent.
* Latency
> I don't want to call this a solved problem, but it is one that scales horizontally very easily and that a lot of existing tech is able to take advantage of easily
Any time you add steps, you increase latency. This is similar to traditional search where you e.g. need to fetch relevant data but scored based on some user-specific metric. Every lookup adds latency. Same is true for RAG.
* Correctness
> They LLM tooling doesn't necessarily need to make things worse here, although poorly designed it definitely could. AI can do a first pass at fact checking, even though I suspect we'll need humans in the loop for a long while.
Again, this comes back to how you index your data and what results are returned; similar to traditional search. This is problem-space dependent. Plus, we haven't solved LLM hallucinations -- there are strategies to mitigate it, but not clearcut solution.
Any tips on effectively getting financial data out of PDFs into a RAG system (especially data contained in tables)? And locally, not via proprietary cloud PDF parsing thingy. That's the current nut I'm trying to crack.
RAG is not just traditional search. It's any augmented data that can be fed to the LLM.
The most useful and verifiable RAG setup I've seen is hooking up a RDBMS and LLM, and asking querying questions in English to retrieve the table data. You can do it in several steps.
1. Extract the metadata of the tables, e.g. table names, columns of each table, related columns of the tables, indexed columns, etc. This is your RAG data.
2. Build the RAG context with the metadata, i.e. listing each table, its columns, relationships, etc.
3. Feed the RAG context and the user's querying questions to the LLM. Tell LLM to generate a SQL for the question given the RAG context.
4. Run the SQL query on the database.
It's uncannily good. And it can be easily verified given the SQL.
Is that RAG though? Perhaps I’m missing something but I don’t see where the retrieval step is. Extracting the metadata and passing it to the LLM in the context sounds like a non-RAG LLM application. Or you’re saying that the DB schema is so big and/or the LLM context too small so not all the metadata can be passed in one go and there’s some search step to prune the number of tables?
RAG is augmenting the llm generation with external data. How the external data is retrieved is irrelevant. A search is not necessary.
Of course you can do a search on the related tables with regard to the question to narrow down the table list to help the llm to come up with the correct answer.
That's exactly what it is, and it's useful because when it works it means you can ask a question and get an answer to your question, rather then having to read the documents and then answer that question yourself.
It also lets a language model answer questions while citing a source, something it fundamentally cannot do on its own.
Everyone talks about "reducing hallucinations" but from a system perspective, everything a LLM emits is equally hallucinated.
Putting the relevant data in context gets around this and provides actual provenance of information, something that is absolutely required for real "knowledge" and which we often take for granted in practice.
Of course, the ability to do so is entirely reliant on the retrieval's search quality. Tradeoffs abound. But with enough clever tricks it does seem possible to take advantage of both the LLMs broad but unsubstantiated content, and specific fact claims.
You just described RAG: augmenting an LLM with external memory. Perhaps the part you are skipping is that the LLM synthesizes the retrieved information with its own knowledge into one coherent whole.
It's abstractive- (new) versus extractive (old) summarization.
What makes it useful is that it does the work of synthesizing the information. Imagine you ask a question that involves bits and pieces of numerous articles. In the past you had to read them all and mentally synthesize them.
I've used something like RAG for finding solutions to questions in slack. I take the question, break it into searchable terms, search slack and get a haystack of results. Then I use a LLM to figure out if the results are relivent, finally at the end i take the top 10 results and summarize them and link back to the slack discussion.
The intent is usually not to simply regurgitate the results, but to augment the prompt with them to enable a better, focussed answer to the user question than either search or an LLM alone would provide.
The buzz is because it is really one of a most widely used new AI things, easily applicable to millions of businesses. Everyone has some large storage of unstructured data they want to search through and ask questions about - legal docs, candidates, books, articles.. At the same time it’s relatively straightforward to implement so it’s already tens or hundreds of startups / products pushing RAG agenda (all these “it seems easy but it’s not!”). Hopefully soon it will be added as a built in LLM feature - ability to upload own data for LLM to use. It also made many more developers aware of embeddings and vector search, which is great.
I'm still building my understanding in this space, but so far I've seen its value when using chains and graphs of agents.
The overall system suggests degrees of freedom in search that might not have been available. This is by having a knowledge store in a format (vectors) primed for search, then having it be accessible in full or in partitions, by agents, working on one or more concurrent flows around a query.
I also see value in having a full circuit of native-format components that can be pieced together to make higher order constructs. Agents is just the most recent one to emerge and i can easily see a mixture of fine tuned experts alongside stores of relevant material.
to me it feels like people are waking up to the fact that with current access to sw/hw, you can now make your own search engine and answering tool based on the data you own.
Isn’t it basically a traditional search (either keyword based, vector based -embeddings have been around for years-, or a combination of both) where you take the top N results (usually not even full docs, but chunks due to context length limitations) and pass them to an LLM to regurgitate a response (hopefully without hallucinations), instead of simply listing the results right away? I think some implementations also ask the LLM to rewrite the user query to “capture the user intent”.
What I’m missing here? What makes it so useful?