I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.
1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?
2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.
3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.
Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.
Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.
It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.
If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.
(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?
(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]
I got very different numbers:
mean_correctness_score 3.941
mean_relevancy_score 0.826
mean_faithfulness_score 0.980
You reported:
mean_correctness_score 3.874
mean_relevancy_score 0.844
mean_faithfulness_score 0.667
Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.
Thanks for running through the benchmark! Just to clarify some things:
(1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.
(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?
One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?
I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?
Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)
You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.
How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.
PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).
Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)
> This is where LlamaParse comes in. We’ve developed a proprietary parsing service that is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.
This is my problem with projects that start off as open source and become famous because of their community contributions and attention, then the project leaders get that sweet VC money (or not) and make something proprietary.
We've seen it with Langchain and several other "fake open source" projects.
I don't disagree but I think it's not a question of "why" but "how".
It could still be licensed in a restricted way, but keeping secret how it works is unfortunate - it breaks the chain of learning that is happening across the open ecosystem and, if the technique is any good, all it does is force open models to build an actually open equivalent so that further progress can be made (and if it's not really any good then it's snake oil, which is worse). Even if it's great it essentially becomes a dead end for the people who actually need and want an open model ecosystem.
We are planning to move our blog off of Medium (we've been busy!), but this post is public so you can actually just click through the nag screen if you see one.
> Is it really that much of a burden to host a mini blog?
The problem is not that it's a hassle to setup and maintain, but that discovery and network effects are a lot smaller when self-hosting. Being on Twitter and Medium gives you a lot more readers from the get-go, just because of the built-in discovery.
I still self-host my own stuff, but I would lie if I said I didn't understand why people use Twitter and Medium for hosting their content.
I would also like how it compares to any of the commercial offerings from Azure/AWS/GCP. They all have document parsing tools that I have found better than tools like unstructured. Sure you don't have some of the "magic" of segmenting text for vectorization and RAG but imo thats the easy part. The hard part is pulling data, forms, tables, text out of the PDF which I find the cloud tools to do a superior job.
Nothing beats Google text extraction tasks in my testing, especially for East Asian languages. I wish something else worked better, because their services are fairly expensive.
I mostly have used Textract but have found all 3 to be fairly similar in accuracy with differences in structure and compatible languages. I think my call out here is that I don't think any of these libraries, llamaindex or unstructured, can compete in this area. I would rather use GCP/Azure/AWS to define the structure from a PDF and these for the rag portion if anything.
Hmm, I have to say I'm pretty unimpressed with my initial experience here.
1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.
2. Fine, I'll sign in with Google.
3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.
LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.
Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.
Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.
It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.
To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.
I think LlamaParse is trying to solve a hard problem. Many enterprise customers I know have strong need to parse PDF files and extract data accurately.
I found the interface a bit confusing. From your blog post, LlamaParse can extract numbers in tables, but it appears that the output isn't provided in tabular format. Instead, access to these numbers is only available through a question-answering. Is this accurate?
Question, why build this when you can use LLMS to extract the data in the most appropriate format to begin with? Isn't this a bit redundant? Perhaps it makes sense in the short term due to cost but in the long run this problem can be solved generically with LLMS.
Sorry for offtopic: Are there any LLM services that I can use in cloud similar to OpenAI? I do not have good enough Macbook to run different models locally
Not OP, in my case OpenAI does not want my money, they only accept credit cards. For example netflix wants my money so they have more choices.
Also I would like to pay for an equivalent alternative that is less censored, like ChatGPT had a bug one day that it refused to tell me how to force a type cast in TypeScript, it showed me a moderation error. So I want an AI that is targeted for adults and not children in some religious school in USA.
there is a somewhat unfiltered GPT4 at Azure, but they really don't want anybody's money (afaik only "trusted" corporate entities can access it)
at this time, your only option is local models. if you don't have the hardware to run them yourself, there are plenty of hosts - poe/perplexity/together etc.
llama3 is (hopefully) coming soon, and if it has improved as much as llama2 improved over llama1, and provides at least 16k baseline context size, it will be in between gpt3.5 and gpt4 in terms of quality, which is mostly enough.
Yes, there are 2 issues for me, my hardware is not powerful enough, only 8Gb vram and the models are still not intelligent enough. At this moment I have opened tabs for different websites and when I ahve a question I compare them and see the state. I would like a model that would say I do not know more often then respond with the wrong thing, also I would like it to follow instructions , now I ask them to "rewrite previous response but without X" and they respond "sure, here is the response without X " and they do not followed the instruction, like they are "hard coded" to do X. An example for X is "do not add a summary or conclusion"
It's the trick where a user asks you a question: "Who worked on the billing UI refresh last year?" - and you turn that question into a search against a bunch of private documents, find the top matches, copy them into a big prompt to an LLM and ask it to use that data to answer the user's question.
There's a HUGE amount of depth to building this well - it's one of the most actively explored parts of LLM/generative-AI at the moment, because being able to ask human-language questions of large private datasets is incredibly useful.
Retrieval-Augmented Generation, where you ask an LLM to answer a question by giving it some context information that you have retrieved from your own data rather than just the data it was trained on.
Explained by gpt itself as if you were a teddy bear.
----
Okay little teddybears, let me explain what retrieval augmented generation is in a way you can understand!
You see, sometimes when big AI models like Claude want to talk about something, they may not know all the facts. But they have a friend named the knowledge base who knows lots of information!
When Claude wants to talk about something new, he first asks the knowledge base "What do you know about X?". The knowledge base looks through all its facts and finds the most helpful ones. Then it shares them with Claude so he has more context before talking.
This process of Claude asking the knowledge base for facts is called retrieval augmented generation. It helps Claude sound smarter and avoid mistakes, because he has extra information from his knowledgeable friend the knowledge base.
The next time Claude wants to chat with you teddybears, he will be even better prepared with facts from the knowledge base to have an interesting conversation!
Spend seed round investments on building a solid software but not building an income stream that can satisfy investors, thus not receiving any new funding and let the company die.
There's gotta be somewhere in the middle. Vercel's movements feel a lot like the "Embrace, extend, and extinguish" playbook.
Maybe there is a class of developer out there that doesn't get spooked by that but it definitely has created an adversarial place for Vercel in my mind. I feel like I need to be careful when touching anything Vercel have touched so that I don't fall into a trap.
If you have any concrete feedback on what we should improve, I’m all ears. We heard feedback from the community that they wanted better documentation and guidance on self-hosting and we shipped it last month[1]. Curious what you’d like to see improved.
I just stopped using their NextJS project because you can no longer self host the middleware, they now only support edge runtime and several libraries don't work with it.
I'm calling this situation Fauxpen Source. The recent moves definitely feel anticompetitive or at least trying to force you into using their products
I'm migrating to vite+vike (next/nuxt like experience for any framework)
Middleware does work with self-hosting[1]. It’s a more limited runtime that’s based on web standard APIs, which creates optionality for running it in high performance / resource constrained scenarios.
It _can_ work, but _won't_ for most real world workloads
Beyond the runtime limitations, it is poorly designed and requires you to effectively write a router when the rest of the system has automatic routing assembly
> PDFs are specifically a problem: I have complex docs with lots of messy formatting. How do I represent this in the right way so the LLM can understand it?
40 years after PostScript and this is still a problem that one needs to throw AI at. I feel the software development and human-computer interaction took a wrong turn along the way. What happened to the semantic web?
It turns out that it takes thought effort to semantically tag/classify everything consistently and completely, so rather than make the decisions, it's easier to just not do it.
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA