Hacker News new | past | comments | ask | show | jobs | submit login

Ultimately, the quality of OCR on PDF is where we are bottlenecked as an industry. And not just in text characters but understanding and feeding to the LLM structured object relationships as we see in tables and graphs. Intuitive for a human, very error prone for RAG.



That's a real issue, but that's masking some of the issues further downstream, like chunking and other context-related problems. There are some clever proposals to make this work, including some of the stuff from Anthropic and Jina. But as far as I can tell, these haven't been tested thoroughly because everyone is hung up at the OCR step (as you identified).


For my purposes, all of the data was also available in HTML format, so the OCR wasn't a problem. I think the issue is the RAG pipeline doesn't take the entire corpus of knowledge into its context when making a response, but uses an index to find one or more relevant documents that it believes are relevant, then uses that small subset as part of the input.

I'm not sure there's a way to get what a lot of people want RAG to be without actually training the model on all of your data, so they can "chat with it" similar to how you can ask ChatGPT about random facts about almost any publicly available information. But I'm not an expert.


I've also observed this issue and I wonder where the industry is on it. There seem to be a lot of claims that a given approach will work here, but not a lot of provably working use cases.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: