Ultimately, the quality of OCR on PDF is where we are bottlenecked as an industr...

phren0logy · 2025-03-11T19:33:33 1741721613

That's a real issue, but that's masking some of the issues further downstream, like chunking and other context-related problems. There are some clever proposals to make this work, including some of the stuff from Anthropic and Jina. But as far as I can tell, these haven't been tested thoroughly because everyone is hung up at the OCR step (as you identified).

pvo50555 · 2025-03-11T20:29:43 1741724983

For my purposes, all of the data was also available in HTML format, so the OCR wasn't a problem. I think the issue is the RAG pipeline doesn't take the entire corpus of knowledge into its context when making a response, but uses an index to find one or more relevant documents that it believes are relevant, then uses that small subset as part of the input.

I'm not sure there's a way to get what a lot of people want RAG to be without actually training the model on all of your data, so they can "chat with it" similar to how you can ask ChatGPT about random facts about almost any publicly available information. But I'm not an expert.

jredwards · 2025-03-11T21:15:15 1741727715

I've also observed this issue and I wonder where the industry is on it. There seem to be a lot of claims that a given approach will work here, but not a lot of provably working use cases.