Show HN: Open-source Rule-based PDF parser for RAG

dmezzetti · on Jan 24, 2024

One additional library to add, if you're working with scientific papers: https://github.com/kermitt2/grobid. I use this with paperetl (https://github.com/neuml/paperetl).

dmezzetti · on Jan 24, 2024

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

mpeg · on Jan 24, 2024

Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

jahewson · on Jan 24, 2024

Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :)

dmezzetti · on Jan 24, 2024

I don't have scientific metrics but I've found the quality much better than most. It does a pretty good job to pulling data from text and tables.

epaga · on Jan 24, 2024

This looks like it could be very helpful. The company I work for has a PDF comparison tool called "PDFC" which can read PDFs and runs comparisons of semantic differences. https://www.inetsoftware.de/products/pdf-content-comparer

Parsing PDFs can be quite the headache because the format is so complex. We support most of these features already but there are always so many edge cases that additional angles can be very helpful.

muzamil-ali · on Jan 25, 2024

You're absolutely right; parsing PDFs can be a real headache due to their inherent complexity. The format itself can vary in structure, layout, and embedded components, making it difficult to extract and compare information consistently. Even with robust tools like PDFC, edge cases can always emerge, requiring further refinements.

lmeyerov · on Jan 24, 2024

Tesseract OCR fallback sounds great!

There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

mpeg · on Jan 24, 2024

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).

Maybe that's what the authors of this tool were thinking too.

asukla · on Jan 24, 2024

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

mpeg · on Jan 24, 2024

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?

ramoz · on Jan 24, 2024

For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

mpeg · on Jan 24, 2024

Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.

rmsaksida · on Jan 24, 2024

Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.

mistrial9 · on Jan 24, 2024

great effort and very interesting. However, I go to Github and I see "This organization has no public members" .. I do not know who you are at all, or what else might be part of this without disclosure.

Overall, I believe there has to be some middle ground for identification and trust building over time, between "hidden group with no names on $CORP secure site" and other traditional means of introduction and trust building.

thanks for posting this interesting and relevant work

asukla · on Jan 24, 2024

Thanks for the post. Please use this server with the llmsherpa LayoutPDFReader to get optimal chunks for your LLM/RAG project: https://github.com/nlmatics/llmsherpa. See examples and notebook in the repo.

firtoz · on Jan 24, 2024

Thank you for sharing. Are there some example input output pairs somewhere?

asukla · on Jan 24, 2024

You can use the library in conjunction with llmsherpa LayoutPDFReader.

Some examples are here with notebook: https://github.com/nlmatics/llmsherpa Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

huqedato · on Jan 24, 2024

I tried to parse a few hundreds pdfs with it. The results are pretty decent. If this was developed in Julia, it would be ten times faster (at least).

guidedlight · on Jan 24, 2024

How does this differ from Azure Document Intelligence, or are they effectively the same thing?

asukla · on Jan 24, 2024

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

ramoz · on Jan 24, 2024

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

StrauXX · on Jan 24, 2024

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

asukla · on Jan 24, 2024

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

infecto · on Jan 24, 2024

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

batch12 · on Jan 24, 2024

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

cdolan · on Jan 24, 2024

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

jvdvegt · on Jan 24, 2024

Do you ave any examples? There doesn't seem to be a single PDF file in the repo.

asukla · on Jan 24, 2024

You can see examples in llmsherpa project - https://github.com/nlmatics/llmsherpa. This project nlm-ingestor provides you the backend to work with llmsherpa. The llmsherpa library is very convenient to use for extracting nice chunks for your LLM/RAG project.

xfalcox · on Jan 24, 2024

We've been looking for something exactly like this, thanks for sharing!

ilaksh · on Jan 24, 2024

How does this compare to PaddleOCR?

Looks like Apache 2 license which is nice.

genewitch · on Jan 25, 2024

"Retrieval Augmented Generation"