Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PgPDF: Pdf Type and Functions for Postgres (github.com/florents-tselai)
97 points by fforflo 10 months ago | hide | past | favorite | 55 comments


Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)

1 - https://github.com/turbot/steampipe


That's a great piece of software indeed. I use it to join and analyze data from multiple APIs with materialized view, as most data analysis software only connects to databases, not APIs.

It's also pretty easy to write custom plugins once you understand how it's done.


Thanks for leaning into Steampipe!


Interesting! Can you tell me the typical use-case for this? Is it more for dev-ops stuff? Like to query all your servers to do some dashboards of your infra? Or can that be used also for ML / datascience etc?


Steampipe is most commonly used for DevSecOps. It works with any BI tool for dashboards (it's just Postgres), but is really great when paired with Powerpipe [1] for security benchmarks, visualizing infrastructure and much more [2]. It can definitely be used for a wide range of data scenarios, e.g. there is an OpenAI plugin [3].

1 - https://github.com/turbot/powerpipe 2 - https://hub.powerpipe.io 2 - https://github.com/turbot/steampipe-plugin-openai


We've been using Steampipe every once in a while and it works as advertised, though perhaps SQL isn't always the abstraction of choice mentally. Congrats on hitting v1!


thank you! We believe that SQL makes the simple stuff easy and the hard stuff doable :). Hopefully our repos & hubs have good enough examples to get you started most of the time!


This is super cool.

Last time I searched only https://fdw.dev came up (from Supabase).


This is super cool! Do you have any integrations with duckdb? It seems like these two could work really well together


Beyond the Steampipe CLI (which has an embedded Postgres), you can use the steampipe plugins as native Postgres FDWs [1], SQLite extensions [2] or command line export tools [3]. We're definitely interested in bringing the plugins to duckdb as well, but haven't had a chance to focus there yet!

1 - https://steampipe.io/blog/2023-12-postgres-extensions 2 - https://steampipe.io/blog/2023-12-sqlite-extensions 3 - https://steampipe.io/blog/2023-12-steampipe-export


the plugins link from your github repo goes to 404: https://hub.powerpipe.io/plugins :)


doh! thanks for the heads up :-(. Fixed to be https://hub.steampipe.io/plugins


I think you mean disclosure.

Why would you disclaim something cool?


Ooo love this rabbit hole. I’ve always used “disclaimer” as “warning”, and would definitely use it for a conflict of interest, but technically the legal field uses it specifically for asserting a lack of liability (ish).

Certainly interesting, but I think I’m gonna stubbornly stand my descriptivist ground here. We conquered “literally”, and “sneak peak” is soon to fall — I’ll add this to the list!


Fair point. I guess that I'm disclosing that I lead the project and disclaiming the self-interest in posting? :)


Looks pretty cool. Will try it out.


Following the links, I find...

    pgPDF: The actual PDF parsing is done by poppler.
    Poppler is a PDF rendering library based on the xpdf-3.0 code base.
    Xpdf is based on XpdfWidget/Qt™, by Glyph & Cog.
    XpdfWidget is based on the same proven code used in Glyph & Cog's XpdfViewer library.
    The XpdfViewer® library / ActiveX control provides a PDF file viewer component for use in Windows applications.
Quite the rabbit hole!

Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.


Beside licensing issues I'm not sure the solution should be to link Poppler (which has a multiple CVE's every year on average) into the database server, especially if you process untrusted data. Seems to be a great way to gain access to the database server.

Functionally it looks useful, but if those kind of 'helpers' catch on there really should be a way to sandbox these 'parser' processes.


Totally agree, this data should be supplied by a "page server" (analogous to a frame server in video production) over http using pdf.js so it can run in a browser based sandbox.

The risks of running this code are just way too high without an org level security policy about what access this compromised machine would have.


I keep going back and forth trying to figure out if this is sarcasm or not. Firstly it sounds sensible, then you're talking about PDF.js in a browser sandbox?!


Not sarcasm. When parsing untrusted, complex input with untrusted code, one should use multiple isolation domains.

    pgsql -> http -> [ firejail [ deno pdf.js (file:///untrusted.pdf) ] ]


Ok, so you're using deno with pdf.js instead of poppler. While Javascript is mostly 'memory safe' using all of deno, pdf.js and firejail make your attack surface huge and difficult to review or constrain and probably tank performance if used on a big dataset because you have to initialize the whole stack per request. All three of those tools have had significant CVE's too so adding more layers increases the amount of CVE's you have to deal with. I also don't see what firejail buys you when you constrain deno (or another parser) to a properly secured container or VM.


Good catch, the whole PDF parsing ecosystem is kinda grey on those things. I tried to be extra-careful and I relied on poppler's official statement:

"Note that Poppler is licensed under the GPL, not the LGPL, so programs which call Poppler must be licensed under the GPL as well. See the section History and GPL licensing for more information."

See https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...


yeah but the xpdf library part of xpdfviewer is open-source (gplv2)

https://www.xpdfreader.com/download.html

note that the source code is not on github, but is just dumped each version as a tar

https://www.xpdfreader.com/old-versions.html

I once needed to have it for some PDF experiments and I put it on github (this is the newest version; I did _not_ go old versions one by one; I just dumped 2 newest versions)

https://github.com/karelbilek/xpdf-mirror


This is fun. It would be interesting to add the able to query references inside the page, like images. That could be modeled as a foreign key relationship to the page. I'm using some Python libraries to do that and everything is wrapped in try/except blocks because PDFs are a mess. I wonder how poppler handles those kind of files.


Fascinating, you wouldn't happen to have the code up would you? What kind of stuff are you doing, and what libraries did you use?


Sorry it's not my code to publish. I'm happy to answer general questions over email (on profile page).

Pypdf2 and pillow to process images at a high level.


Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)

Some clarifications on a few comments I see downstream:

The motivating example was to easily support Full-Text Search (FTS) on PDFs with SQL only (see blog post https://tselai.com/full-text-search-pdf-postgres ). You can treat `pdf` as an alias for `text` and do everything possible.

On the next iteration, I made `pdf` a type (typical varlena object of bytes) to avoid hitting disk all the time. The file is loaded from the disk only once (if it's a valid pdf). One can store the `pdf` type (blob of bytes) as a standard Postgres type. And use that for subsequent calls. Postgres will do it's magic as usual. There is a potential next step of storing the parsed document just to save some time from re-parsing the bytes, but I deemed it a premature optimization.


> Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)

It's HN's second-chance pool: https://news.ycombinator.com/item?id=11662380


For one minute I thought - what a stupid idea, wrong level of abstraction. Now I think I might actually use this in an analysis setting for convenience. I guess I'll quickly find out what kinds of timeouts I'll run into once I ask for the titles of 10k documents.


I looked at it and initially thought this is a stupid idea to store PDF files in your database and I continued to think this is even worse when it uses files on the file system. If you want this, index your PDF files into a table when they are uploaded.

The only functions here all take a filesystem path which your database should definitely not have access to - why would you upload files/store PDFs on a database server!!?

These functions to be able to get the title or modification time of a PDF are also just not that useful.


> why would you upload files/store PDFs on a database server!!?

I don't really understand this question. You can put data where you like. There are no "database servers". These aren't whole toys, take them or leave them. They're made of Lego bricks, and so you can change them.


Yes, but clients shouldn't be encouraged to refer to server paths. It's unnecessary exposure of details and makes it harder to move things around. Also I wonder if this requires extra permissions. I know COPY from server files requires extra permissions:

> COPY naming a file or command is only allowed to database superusers or users who are granted one of the roles pg_read_server_files, pg_write_server_files, or pg_execute_server_program, since it allows reading or writing any file or running a program that the server has privileges to access.


What do you mean by clients in this case?


It is perfectly legitimate (and in fact preferable in many cases) to store files in database BLOBs. Backups, consistency, security and (yes!) scalability come to mind.

Some early BLOB implementations had performance problems, which is where this notion that "you shouldn't store files in BLOBs" seems to come from. But modern DBMSes have fast BLOBs, especially if used properly through their streaming API (don't materialize the entire file in memory!).

We have a system in production that stores millions of files in BLOBs, some of them reaching multi-GB sizes, being accessed across the globe (a big enterprise company) by thousands of engineers, and never had performance problems with BLOBs.


> The only functions here all take a filesystem path

From the documentation:

> Creating a pdf type, by casting either text path or bytea blob.

With the example: SELECT ''::bytea::pdf;

So it's convenient to use the path to test quickly, but you can use anything in PostgreSQL which return (or can be convert) a bytea


So potentially there’s a full text search case (?) for doing this but I’d always just default to adding documents like this to elastic search instead. I just cannot see a good use case for this and lots of bad ones.


It's probably mainly aimed at data pipeline use cases. It's common to have stages of very "messy" data that you'll want to clean up. This lets you keep that part of it all in Postgres, if you like, if your messy data is PDFs, which is a fairly common case.

Not every object in a database needs to be ready for public consumption. Some of it's there for processing, for certain use cases.


The first iteration of this was actually FTS only https://tselai.com/full-text-search-pdf-postgres

In the second pass I made `pdf` a type.


The postgres ecosystem keeps impressing me with it's creativity


It would be neat to see this as a TOAST type in Postgres, where the PDF was kept in a data structure with the PDF parsed. It would be relatively straightforward to perform searches and index/reindex deep into the documents.


I'm missing something I think. Currently the `pdf` type is indeed a blob of bytes. Read from disk, but not parsed. The type is created with storage=extended to avoid the typical 8KB tuple size limit.

In practice though, a PDF for most cases has text-like semantics, so with ::pdf::text you can have all the text-indexes you want.


This way, you could keep pagination intact and only retrieve individual pages or sections on demand.


Whilst this is cool, why would we want to push this logic into the DB?

It seems cleaner to keep this in the service layer and use any PDF parsing library and subsequent schema to store the parsed files.


Interesting!

I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?


It can be useful for improving ingestion pipeline: put your pdf collection in a temp table and then extract with pure SQL the information you want.


Slightly related - Are PDFs natively compressed? They would probably compress well since they're often mostly text. Saving space in the database.


Yes; but so are any PostgreSQL columns the size of PDF blobs.


As a general rule, yes they are compressed. As you point out, they compress really well.


Im having difficulty to understand what’s the use case for this..


Instead of having to have 27 columns you have one... I guess?

This also makes interop more difficult when working with indexing databases (i.e. elastic).

This does minimize the amount of client code required to parse pdf's though.


But then it means I need to save the PDF in the database rather than some kind of storage


psql is very good at that since 9.0


RAG! AFAICT it doesn’t do OCR off the shelf, but surely you could hook up Nougat/Tesseract/TOCR without too much difficulty and start parsing any PDFs that’s aren’t parsed yet.


Now create a dbeaver extension to view the type




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: