Hacker News new | past | comments | ask | show | jobs | submit login
LlamaCloud and LlamaParse (llamaindex.ai)
195 points by eferreira_ on Feb 20, 2024 | hide | past | favorite | 82 comments



I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).

For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.

Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).

AMA


For context: I’m an engineering manager for the production systems of one of the biggest mortgage companies. Think millions of millions of all kinds of PDF.

1. The comparison with the open source pdf libraries is rather strange. They are self contained libraries. Not ML augmented services. Does this mean you plan to open source the underlying technology and offer it under the same licenses as pypdf or pymupdf?

2. How does this compare to AWS Kendra? I have one of the bigger deployments out there and am looking for alternatives.

3. How fast is the pdf extraction? In terms of microseconds given we pay for execution time.


Making something appear as if it might be an open source library, but is actually just a wrapper to a paid hosted service is endemic in the Python world in general and the LLM space in particular.


Feel free to try - https://github.com/nlmatics/llmsherpa. It is fully open source - both client and server and it not ML augmented, so very fast and cheap to run.


Thank you for sharing.


Out of interest - Why are you looking to move from Kendra? I havent heard much good about it but keen to understand what your issues with it are.


It’s too expensive for the features it provides. The person who estimated the cost got it wrong and were paying 3x. But I can’t just rip it out due to it powering a very important piece of software.


If you're serious about this, I'm working on a new startup about this that is 100% working on using different techniques to help with this. I'd love to talk to you about what your needs are as it would help me have more data points as I take this to market. My email is in my profile if you shoot a message.


Sorry, but I’m in the process of building my own and can’t provide any info due to regulations. Best of luck.


we have had great success with apache pdfbox.


I am confused by the benchmarks you provided.

(1) The "baseline" comparison was to PyPDF + Naive RAG. For the LlamaParse evaluation, you appear to have used a different RAG pipeline, called "recursive retrieval." Why not use the same pipeline to demonstrate the improvement from LlamaParse? Can you share the code to your evaluation for LlamaParse?

(2) I ran the benchmark for the PyPDF + Naive RAG solution, directly copying the code on the linked LlamaIndex repo [1]

I got very different numbers: mean_correctness_score 3.941 mean_relevancy_score 0.826 mean_faithfulness_score 0.980

You reported: mean_correctness_score 3.874 mean_relevancy_score 0.844 mean_faithfulness_score 0.667

Notably, the faithfulness score I measured for the baseline solution was actually higher than that reported for your proprietary LlamaParse based solution.

[1] https://github.com/run-llama/llama-hub/tree/main/llama_hub/l...


(jerry here)

Thanks for running through the benchmark! Just to clarify some things: (1) The idea is that LlamaParse's markdown representation lends itself to the rest of LlamaIndex advanced indexing/retrieval abstractions. Recursive retrieval is a fancy retrieval method designed to model documents with embedded objects, but depends on good PDF parsing. Naive PyPDF parsing can't be used with recursive retrieval. Our goal is to demonstrate the e2e RAG capabilities of LlamaParse + advanced retrieval vs. what you can build with a naive PDF parser.

(2). Since we use LLM-based evals, your correctness and relevancy metric look to be consistent and within margin of error (and lower than our llamaparse metrics). The faithfulness score seems way off though and quite high from your side, so not sure what's going on there. maybe hop in our discord and share the results in our channel?


One of the things I've been helping a team with is dealign with mountains of ppt decks, converted to pdf, and then parsed/chunked/embedded into vector storage. It doesn't work that well because a ppt is not a document. What are your thoughts when dealing with other formats first converted to pdf?


For PPT, chuncking 'per page' work often quite well. With LlamaParse this will mean splitting on the "\n---\n" page separator token.


I tried LlamaParse and was impressed by the result on a document with complex layout. None of the open-source parsers didn't provide me even close results. Can you please share, how much time did it take to your team to build this parser?


Does it work with other filetype converted into PDFs? For example docx, ppt, png, etc.


Yes, however we will soon support other filetypes natively, and this will lead to better results (when converting from one format to another, there is often some information loss)


Cool! Which OCR engine/model do you use?


EasyOCR, may switch to paddleOCR in the future.


You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.


How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.


I unfortunately haven't had time to benchmark against more than tesseract.


That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.


Thanks for sharing.


PaddleOCR works pretty well, how are you planning to integrate it in your workflow? I found huge differences in throughput between python serving and frameworks (i.e. NVIDIA Triton Inference Server).


Grateful for your insight! Could you explain the reason for the switch? Is there any benchmark data available for sharing?


Performance depend on the language / type of docs. Main reason for contemplating switching is that easyOCR seems to not be maintained anymore (no commit in the repo in last 5 months)


Can it detect and strip out advertisements?


> This is where LlamaParse comes in. We’ve developed a proprietary parsing service that is incredibly good at parsing PDFs with complex tables into a well-structured markdown format.

This is my problem with projects that start off as open source and become famous because of their community contributions and attention, then the project leaders get that sweet VC money (or not) and make something proprietary.

We've seen it with Langchain and several other "fake open source" projects.


LlamaParse is proprietary but the main LI package isn’t and you don’t need the former to use the latter.

Why shouldn’t they make money? LI is a fantastic way to do RAG.


I don't disagree but I think it's not a question of "why" but "how".

It could still be licensed in a restricted way, but keeping secret how it works is unfortunate - it breaks the chain of learning that is happening across the open ecosystem and, if the technique is any good, all it does is force open models to build an actually open equivalent so that further progress can be made (and if it's not really any good then it's snake oil, which is worse). Even if it's great it essentially becomes a dead end for the people who actually need and want an open model ecosystem.


I don't understand why post this on medium ? Medium doesn't let me even read anymore. If you have a blog post on it so your audience can reach you.


We are planning to move our blog off of Medium (we've been busy!), but this post is public so you can actually just click through the nag screen if you see one.


Also, have a X (formerly Twitter) thread:

https://x.com/llama_index/status/1759987390435996120?s=20


Which also isn't really available to unregistered users, can only see the first tweet: https://i.imgur.com/SJA2Gzs.png


Right! Both Medium and Twitter is hostile you don’t join them. Is it really that much of a burden to host a mini blog?


> Is it really that much of a burden to host a mini blog?

The problem is not that it's a hassle to setup and maintain, but that discovery and network effects are a lot smaller when self-hosting. Being on Twitter and Medium gives you a lot more readers from the get-go, just because of the built-in discovery.

I still self-host my own stuff, but I would lie if I said I didn't understand why people use Twitter and Medium for hosting their content.


I wonder how LlamaParse compares head to head with https://unstructured.io


I would also like how it compares to any of the commercial offerings from Azure/AWS/GCP. They all have document parsing tools that I have found better than tools like unstructured. Sure you don't have some of the "magic" of segmenting text for vectorization and RAG but imo thats the easy part. The hard part is pulling data, forms, tables, text out of the PDF which I find the cloud tools to do a superior job.


Nothing beats Google text extraction tasks in my testing, especially for East Asian languages. I wish something else worked better, because their services are fairly expensive.


I mostly have used Textract but have found all 3 to be fairly similar in accuracy with differences in structure and compatible languages. I think my call out here is that I don't think any of these libraries, llamaindex or unstructured, can compete in this area. I would rather use GCP/Azure/AWS to define the structure from a PDF and these for the rag portion if anything.



not clear to me why this got downvoted. sensible question.


Hmm, I have to say I'm pretty unimpressed with my initial experience here.

1. The sign up with email just endlessly redirected, click link in email, ask to sign up with email, put in email, click link in email, etc.

2. Fine, I'll sign in with Google.

3. A PDF parser? Seriously that's what all this fuss is about? There are so many options already out there, PDFBox, iText, Unstructured, PyPDF, PDF.js, PdfMiner not to mention extraction services available from the hyperscalers. Super confused why anyone needs this.


LLaMA Index is way more than a PDF parser. It's the most widely used RAG tool chain and their cloud looks to be a managed version of that.

Specific to the parser, they do show where tools like those you mentioned fail and their LLM based parser captures the full data the aforementioned miss.


Yeah, but their platform is basically a janky PDF parser which is why I don't understand what the hype is about.

It's easy to cherry pick a PDF for marketing purposes and claim you're better. I didn't miss it, I just don't believe marketing announcements at face value. I tried their parser on a PDF with a bit of complex formatting like multiple columns, tables and a couple images and it choked, spitting out one big markdown header with jumbled text. Not impressed.


There is a PDF parser, LlamaParse, (which is open to everyone), and a managed ingestion/retrieval service, that is currently invite-only.

Planning broader releases in the future for sure.


To get good RAG performance you will need a good chunking strategy. Simply getting all the text is not good enough and knowing the boundaries of table, list, paragraph, section etc. is helpful.

Great work by llamaindex team. Also feel free to try https://github.com/nlmatics/llmsherpa which takes into account some of the things I mentioned.


I think LlamaParse is trying to solve a hard problem. Many enterprise customers I know have strong need to parse PDF files and extract data accurately. I found the interface a bit confusing. From your blog post, LlamaParse can extract numbers in tables, but it appears that the output isn't provided in tabular format. Instead, access to these numbers is only available through a question-answering. Is this accurate?


The output is either text or markdown, and from there you can handle it however you need.

In LlamaIndex for example, there are a a few markdown-specific classes that work well with this.

You can find an example over in the repo -- https://github.com/run-llama/llama_parse/blob/main/examples/...


I was hoping to get structured data. For example, parsing an voice will give results like {"title"... "line_items": [...], "date":...}


Question, why build this when you can use LLMS to extract the data in the most appropriate format to begin with? Isn't this a bit redundant? Perhaps it makes sense in the short term due to cost but in the long run this problem can be solved generically with LLMS.


What will the pricing be like?


LlamaParse solves exactly the problem I've encountered over and over with RAG. Getting structured info from unstructured data is a pain.


Isn't that similar to what AWS Textract does? They have this functionality of parsing and querying info from tables and forms.

I'm sure that for LI, having it as part of a workflow and history to retrieve for RAG makes it easier for users, but why reinvent the wheel?

Unless I missed something


Sorry for offtopic: Are there any LLM services that I can use in cloud similar to OpenAI? I do not have good enough Macbook to run different models locally


AI21 Labs provides access to its Jurassic-2 LLM (and derived models for summarization, grammatical error correction etc.) via a web UI [1] and an API.

Disclaimer: I'm a software engineer at AI21.

[1] https://www.ai21.com/studio


If you use the LLM Chatbot arena[1], you can get two bots to compete to solve your prompts!

[1]: https://chat.lmsys.org/?arena


Hmmm... OpenAI itself?

Did you intend to rule out OpenAI from consideration?

You mentioned hardware being a constraint, but that doesn't tell me why you specifically wanted to find an alternative to OpenAI.


Not OP, in my case OpenAI does not want my money, they only accept credit cards. For example netflix wants my money so they have more choices.

Also I would like to pay for an equivalent alternative that is less censored, like ChatGPT had a bug one day that it refused to tell me how to force a type cast in TypeScript, it showed me a moderation error. So I want an AI that is targeted for adults and not children in some religious school in USA.


there is a somewhat unfiltered GPT4 at Azure, but they really don't want anybody's money (afaik only "trusted" corporate entities can access it)

at this time, your only option is local models. if you don't have the hardware to run them yourself, there are plenty of hosts - poe/perplexity/together etc.

llama3 is (hopefully) coming soon, and if it has improved as much as llama2 improved over llama1, and provides at least 16k baseline context size, it will be in between gpt3.5 and gpt4 in terms of quality, which is mostly enough.


Yes, there are 2 issues for me, my hardware is not powerful enough, only 8Gb vram and the models are still not intelligent enough. At this moment I have opened tabs for different websites and when I ahve a question I compare them and see the state. I would like a model that would say I do not know more often then respond with the wrong thing, also I would like it to follow instructions , now I ask them to "rewrite previous response but without X" and they respond "sure, here is the response without X " and they do not followed the instruction, like they are "hard coded" to do X. An example for X is "do not add a summary or conclusion"


I have used openAI but I want to try several other LLMs as well.


TogetherAI: Unaffiliated, but I've found it good.

Others include Runpod, Replicate, probably others.


I think Anthropic and Mistral offer this but you have to join their waiting lists first.


LlamaParse looks nice. Is there way to return page numbers also with the markdown? This is important for our use case.


What's a RAG application


RAG stands for Retrieval Augmented Generation.

It's the trick where a user asks you a question: "Who worked on the billing UI refresh last year?" - and you turn that question into a search against a bunch of private documents, find the top matches, copy them into a big prompt to an LLM and ask it to use that data to answer the user's question.

There's a HUGE amount of depth to building this well - it's one of the most actively explored parts of LLM/generative-AI at the moment, because being able to ask human-language questions of large private datasets is incredibly useful.


Retrieval-Augmented Generation, where you ask an LLM to answer a question by giving it some context information that you have retrieved from your own data rather than just the data it was trained on.


what is RAG?


retrieval augmented generation.

Explained by gpt itself as if you were a teddy bear.

----

Okay little teddybears, let me explain what retrieval augmented generation is in a way you can understand!

You see, sometimes when big AI models like Claude want to talk about something, they may not know all the facts. But they have a friend named the knowledge base who knows lots of information!

When Claude wants to talk about something new, he first asks the knowledge base "What do you know about X?". The knowledge base looks through all its facts and finds the most helpful ones. Then it shares them with Claude so he has more context before talking.

This process of Claude asking the knowledge base for facts is called retrieval augmented generation. It helps Claude sound smarter and avoid mistakes, because he has extra information from his knowledgeable friend the knowledge base.

The next time Claude wants to chat with you teddybears, he will be even better prepared with facts from the knowledge base to have an interesting conversation!


is this related to vector dbs?


Modern playbook:

1. Build janky open source code base

2. Sell compute to run it

3. Build features that create compute lock in (vercel is a master at this)


Here's an alternative:

Spend seed round investments on building a solid software but not building an income stream that can satisfy investors, thus not receiving any new funding and let the company die.


There's gotta be somewhere in the middle. Vercel's movements feel a lot like the "Embrace, extend, and extinguish" playbook.

Maybe there is a class of developer out there that doesn't get spooked by that but it definitely has created an adversarial place for Vercel in my mind. I feel like I need to be careful when touching anything Vercel have touched so that I don't fall into a trap.


If you have any concrete feedback on what we should improve, I’m all ears. We heard feedback from the community that they wanted better documentation and guidance on self-hosting and we shipped it last month[1]. Curious what you’d like to see improved.

[1] https://nextjs.org/blog/next-14-1#improved-self-hosting


I just stopped using their NextJS project because you can no longer self host the middleware, they now only support edge runtime and several libraries don't work with it.

I'm calling this situation Fauxpen Source. The recent moves definitely feel anticompetitive or at least trying to force you into using their products

I'm migrating to vite+vike (next/nuxt like experience for any framework)


Middleware does work with self-hosting[1]. It’s a more limited runtime that’s based on web standard APIs, which creates optionality for running it in high performance / resource constrained scenarios.

[1] https://nextjs.org/docs/app/building-your-application/deploy...


It _can_ work, but _won't_ for most real world workloads

Beyond the runtime limitations, it is poorly designed and requires you to effectively write a router when the rest of the system has automatic routing assembly


> PDFs are specifically a problem: I have complex docs with lots of messy formatting. How do I represent this in the right way so the LLM can understand it?

40 years after PostScript and this is still a problem that one needs to throw AI at. I feel the software development and human-computer interaction took a wrong turn along the way. What happened to the semantic web?


It turns out that it takes thought effort to semantically tag/classify everything consistently and completely, so rather than make the decisions, it's easier to just not do it.


What?

We still have 'the web'. PDFs are something different and separate.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: