LlamaCloud and LlamaParse

pierre · on Feb 20, 2024

I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).

For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).

AMA

pryelluw · on Feb 21, 2024

For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.

1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?

2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.

3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.

analyte123 · on Feb 21, 2024

Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.

asukla · on Feb 22, 2024

Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.

pryelluw · on Feb 22, 2024

Thank you for sharing.

redactive · on Feb 21, 2024

Out of interest - Why are you looking to move from Kendra? I havent heard much good about it but keen to understand what your issues with it are.

pryelluw · on Feb 21, 2024

It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.

mahmoudimus · on Feb 21, 2024

If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.

pryelluw · on Feb 21, 2024

Sorry, but I’m in the process of building my own and can’t provide any info due to regulations. Best of luck.

johnthescott · on Feb 22, 2024

we have had great success with apache pdfbox.

brendanashworth · on Feb 21, 2024

I am confused by the benchmarks you provided.

(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?

(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]

I got very different numbers: mean_correctness_score 3.941 mean_relevancy_score 0.826 mean_faithfulness_score 0.980

You reported: mean_correctness_score 3.874 mean_relevancy_score 0.844 mean_faithfulness_score 0.667

Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.

[1] https://github.com/run-llama/llama-hub/tree/main/llama_hub/l...

freezed8 · on Feb 21, 2024

(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?

chasd00 · on Feb 20, 2024

One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?

pierre · on Feb 21, 2024

For PPT, chuncking 'per page' work often quite well. With LlamaParse this will mean splitting on the "\n---\n" page separator token.

hpylieva · on Feb 23, 2024

I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?

a2code · on Feb 20, 2024

Does it work with other filetype converted into PDFs? For example docx, ppt, png, etc.

pierre · on Feb 21, 2024

Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)

binarymax · on Feb 20, 2024

Cool! Which OCR engine/model do you use?

pierre · on Feb 20, 2024

EasyOCR, may switch to paddleOCR in the future.

vikp · on Feb 20, 2024

You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

raffraffraff · on Feb 20, 2024

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.

vikp · on Feb 21, 2024

I unfortunately haven't had time to benchmark against more than tesseract.

kergonath · on Feb 21, 2024

That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.

pryelluw · on Feb 21, 2024

Thanks for sharing.

joaquincabezas · on Feb 21, 2024

PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).

helloericsf · on Feb 21, 2024

Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?

pierre · on Feb 21, 2024

Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)

amelius · on Feb 20, 2024

Can it detect and strip out advertisements?

behnamoh · on Feb 20, 2024

> This is where LlamaParse comes in. We’ve developed a proprietary parsing service that is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.

This is my problem with projects that start off as open source and become famous because of their community contributions and attention, then the project leaders get that sweet VC money (or not) and make something proprietary.

We've seen it with Langchain and several other "fake open source" projects.

siquick · on Feb 20, 2024

LlamaParse is proprietary but the main LI package isn’t and you don’t need the former to use the latter.

Why shouldn’t they make money? LI is a fantastic way to do RAG.

zmmmmm · on Feb 20, 2024

I don't disagree but I think it's not a question of "why" but "how".

It could still be licensed in a restricted way, but keeping secret how it works is unfortunate - it breaks the chain of learning that is happening across the open ecosystem and, if the technique is any good, all it does is force open models to build an actually open equivalent so that further progress can be made (and if it's not really any good then it's snake oil, which is worse). Even if it's great it essentially becomes a dead end for the people who actually need and want an open model ecosystem.

srameshc · on Feb 20, 2024

I don't understand why post this on medium ? Medium doesn't let me even read anymore. If you have a blog post on it so your audience can reach you.

seldo · on Feb 20, 2024

We are planning to move our blog off of Medium (we've been busy!), but this post is public so you can actually just click through the nag screen if you see one.

eferreira_ · on Feb 20, 2024

Also, have a X (formerly Twitter) thread:

https://x.com/llama_index/status/1759987390435996120?s=20

diggan · on Feb 20, 2024

Which also isn't really available to unregistered users, can only see the first tweet: https://i.imgur.com/SJA2Gzs.png

Alifatisk · on Feb 21, 2024

Right! Both Medium and Twitter is hostile you don’t join them. Is it really that much of a burden to host a mini blog?

diggan · on Feb 21, 2024

> Is it really that much of a burden to host a mini blog?

The problem is not that it's a hassle to setup and maintain, but that discovery and network effects are a lot smaller when self-hosting. Being on Twitter and Medium gives you a lot more readers from the get-go, just because of the built-in discovery.

I still self-host my own stuff, but I would lie if I said I didn't understand why people use Twitter and Medium for hosting their content.

johnsutor · on Feb 20, 2024

I wonder how LlamaParse compares head to head with https://unstructured.io

infecto · on Feb 20, 2024

I would also like how it compares to any of the commercial offerings from Azure/AWS/GCP. They all have document parsing tools that I have found better than tools like unstructured. Sure you don't have some of the "magic" of segmenting text for vectorization and RAG but imo thats the easy part. The hard part is pulling data, forms, tables, text out of the PDF which I find the cloud tools to do a superior job.

bugglebeetle · on Feb 21, 2024

Nothing beats Google text extraction tasks in my testing, especially for East Asian languages. I wish something else worked better, because their services are fairly expensive.

infecto · on Feb 21, 2024

I mostly have used Textract but have found all 3 to be fairly similar in accuracy with differences in structure and compatible languages. I think my call out here is that I don't think any of these libraries, llamaindex or unstructured, can compete in this area. I would rather use GCP/Azure/AWS to define the structure from a PDF and these for the rag portion if anything.

youngNed · on Feb 21, 2024

Be careful with unstructured:

https://github.com/Unstructured-IO/unstructured/blob/d11c70c...

from: https://github.com/open-webui/open-webui/issues/687

justanotheratom · on Feb 20, 2024

not clear to me why this got downvoted. sensible question.

kurts_mustache · on Feb 20, 2024

Hmm, I have to say I'm pretty unimpressed with my initial experience here.

1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.

2. Fine, I'll sign in with Google.

3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.

verdverm · on Feb 21, 2024

LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.

Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.

kurts_mustache · on Feb 21, 2024

Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.

It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.

cheesyFishes · on Feb 21, 2024

There is a PDF parser, LlamaParse, (which is open to everyone), and a managed ingestion/retrieval service, that is currently invite-only.

Planning broader releases in the future for sure.

asukla · on Feb 22, 2024

To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.

lolpanda · on Feb 20, 2024

I think LlamaParse is trying to solve a hard problem. Many enterprise customers I know have strong need to parse PDF files and extract data accurately. I found the interface a bit confusing. From your blog post, LlamaParse can extract numbers in tables, but it appears that the output isn't provided in tabular format. Instead, access to these numbers is only available through a question-answering. Is this accurate?

cheesyFishes · on Feb 20, 2024

The output is either text or markdown, and from there you can handle it however you need.

In LlamaIndex for example, there are a a few markdown-specific classes that work well with this.

You can find an example over in the repo -- https://github.com/run-llama/llama_parse/blob/main/examples/...

lolpanda · on Feb 20, 2024

I was hoping to get structured data. For example, parsing an voice will give results like {"title"... "line_items": [...], "date":...}

_pdp_ · on Feb 20, 2024

Question, why build this when you can use LLMS to extract the data in the most appropriate format to begin with? Isn't this a bit redundant? Perhaps it makes sense in the short term due to cost but in the long run this problem can be solved generically with LLMS.

bx376 · on Feb 20, 2024

What will the pricing be like?

lxe · on Feb 20, 2024

LlamaParse solves exactly the problem I've encountered over and over with RAG. Getting structured info from unstructured data is a pain.

Oras · on Feb 21, 2024

Isn't that similar to what AWS Textract does? They have this functionality of parsing and querying info from tables and forms.

I'm sure that for LI, having it as part of a workflow and history to retrieve for RAG makes it easier for users, but why reinvent the wheel?

Unless I missed something

pknerd · on Feb 20, 2024

Sorry for offtopic: Are there any LLM services that I can use in cloud similar to OpenAI? I do not have good enough Macbook to run different models locally

fredb-ai21 · on Feb 21, 2024

AI21 Labs provides access to its Jurassic-2 LLM (and derived models for summarization, grammatical error correction etc.) via a web UI [1] and an API.

Disclaimer: I'm a software engineer at AI21.

[1] https://www.ai21.com/studio

sebastiennight · on Feb 20, 2024

If you use the LLM Chatbot arena[1], you can get two bots to compete to solve your prompts!

[1]: https://chat.lmsys.org/?arena

tslmy · on Feb 20, 2024

Hmmm... OpenAI itself?

Did you intend to rule out OpenAI from consideration?

You mentioned hardware being a constraint, but that doesn't tell me why you specifically wanted to find an alternative to OpenAI.

simion314 · on Feb 20, 2024

Not OP, in my case OpenAI does not want my money, they only accept credit cards. For example netflix wants my money so they have more choices.

Also I would like to pay for an equivalent alternative that is less censored, like ChatGPT had a bug one day that it refused to tell me how to force a type cast in TypeScript, it showed me a moderation error. So I want an AI that is targeted for adults and not children in some religious school in USA.

123yawaworht456 · on Feb 21, 2024

there is a somewhat unfiltered GPT4 at Azure, but they really don't want anybody's money (afaik only "trusted" corporate entities can access it)

at this time, your only option is local models. if you don't have the hardware to run them yourself, there are plenty of hosts - poe/perplexity/together etc.

llama3 is (hopefully) coming soon, and if it has improved as much as llama2 improved over llama1, and provides at least 16k baseline context size, it will be in between gpt3.5 and gpt4 in terms of quality, which is mostly enough.

simion314 · on Feb 21, 2024

Yes, there are 2 issues for me, my hardware is not powerful enough, only 8Gb vram and the models are still not intelligent enough. At this moment I have opened tabs for different websites and when I ahve a question I compare them and see the state. I would like a model that would say I do not know more often then respond with the wrong thing, also I would like it to follow instructions , now I ask them to "rewrite previous response but without X" and they respond "sure, here is the response without X " and they do not followed the instruction, like they are "hard coded" to do X. An example for X is "do not add a summary or conclusion"

pknerd · on Feb 20, 2024

I have used openAI but I want to try several other LLMs as well.

nl · on Feb 21, 2024

TogetherAI: Unaffiliated, but I've found it good.

Others include Runpod, Replicate, probably others.

adhamsalama · on Feb 20, 2024

I think Anthropic and Mistral offer this but you have to join their waiting lists first.

technics256 · on Feb 20, 2024

LlamaParse looks nice. Is there way to return page numbers also with the markdown? This is important for our use case.

coding123 · on Feb 20, 2024

What's a RAG application

simonw · on Feb 20, 2024

RAG stands for Retrieval Augmented Generation.

It's the trick where a user asks you a question: "Who worked on the billing UI refresh last year?" - and you turn that question into a search against a bunch of private documents, find the top matches, copy them into a big prompt to an LLM and ask it to use that data to answer the user's question.

There's a HUGE amount of depth to building this well - it's one of the most actively explored parts of LLM/generative-AI at the moment, because being able to ask human-language questions of large private datasets is incredibly useful.

seldo · on Feb 20, 2024

Retrieval-Augmented Generation, where you ask an LLM to answer a question by giving it some context information that you have retrieved from your own data rather than just the data it was trained on.

baby · on Feb 20, 2024

what is RAG?

doublerabbit · on Feb 20, 2024

retrieval augmented generation.

Explained by gpt itself as if you were a teddy bear.

----

Okay little teddybears, let me explain what retrieval augmented generation is in a way you can understand!

You see, sometimes when big AI models like Claude want to talk about something, they may not know all the facts. But they have a friend named the knowledge base who knows lots of information!

When Claude wants to talk about something new, he first asks the knowledge base "What do you know about X?". The knowledge base looks through all its facts and finds the most helpful ones. Then it shares them with Claude so he has more context before talking.

This process of Claude asking the knowledge base for facts is called retrieval augmented generation. It helps Claude sound smarter and avoid mistakes, because he has extra information from his knowledgeable friend the knowledge base.

The next time Claude wants to chat with you teddybears, he will be even better prepared with facts from the knowledge base to have an interesting conversation!

baby · on Feb 22, 2024

is this related to vector dbs?

ldjkfkdsjnv · on Feb 20, 2024

Modern playbook:

1. Build janky open source code base

2. Sell compute to run it

3. Build features that create compute lock in (vercel is a master at this)

tslmy · on Feb 20, 2024

Here's an alternative:

Spend seed round investments on building a solid software but not building an income stream that can satisfy investors, thus not receiving any new funding and let the company die.

gbickford · on Feb 21, 2024

There's gotta be somewhere in the middle. Vercel's movements feel a lot like the "Embrace, extend, and extinguish" playbook.

Maybe there is a class of developer out there that doesn't get spooked by that but it definitely has created an adversarial place for Vercel in my mind. I feel like I need to be careful when touching anything Vercel have touched so that I don't fall into a trap.

Rauchg · on Feb 21, 2024

If you have any concrete feedback on what we should improve, I’m all ears. We heard feedback from the community that they wanted better documentation and guidance on self-hosting and we shipped it last month[1]. Curious what you’d like to see improved.

[1] https://nextjs.org/blog/next-14-1#improved-self-hosting

verdverm · on Feb 21, 2024

I just stopped using their NextJS project because you can no longer self host the middleware, they now only support edge runtime and several libraries don't work with it.

I'm calling this situation Fauxpen Source. The recent moves definitely feel anticompetitive or at least trying to force you into using their products

I'm migrating to vite+vike (next/nuxt like experience for any framework)

Rauchg · on Feb 21, 2024

Middleware does work with self-hosting[1]. It’s a more limited runtime that’s based on web standard APIs, which creates optionality for running it in high performance / resource constrained scenarios.

[1] https://nextjs.org/docs/app/building-your-application/deploy...

verdverm · on Feb 21, 2024

It _can_ work, but _won't_ for most real world workloads

Beyond the runtime limitations, it is poorly designed and requires you to effectively write a router when the rest of the system has automatic routing assembly

miohtama · on Feb 20, 2024

> PDFs are specifically a problem: I have complex docs with lots of messy formatting. How do I represent this in the right way so the LLM can understand it?

40 years after PostScript and this is still a problem that one needs to throw AI at. I feel the software development and human-computer interaction took a wrong turn along the way. What happened to the semantic web?

avhon1 · on Feb 20, 2024

It turns out that it takes thought effort to semantically tag/classify everything consistently and completely, so rather than make the decisions, it's easier to just not do it.

madeofpalk · on Feb 20, 2024

What?

We still have 'the web'. PDFs are something different and separate.