PgPDF: Pdf Type and Functions for Postgres

nathanwallace · 2024-11-13T12:33:12 1731501192

Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)

1 - https://github.com/turbot/steampipe

nlm · 2024-11-13T14:12:12 1731507132

That's a great piece of software indeed. I use it to join and analyze data from multiple APIs with materialized view, as most data analysis software only connects to databases, not APIs.

It's also pretty easy to write custom plugins once you understand how it's done.

nathanwallace · 2024-11-13T16:03:50 1731513830

Thanks for leaning into Steampipe!

oulipo · 2024-11-13T14:07:03 1731506823

Interesting! Can you tell me the typical use-case for this? Is it more for dev-ops stuff? Like to query all your servers to do some dashboards of your infra? Or can that be used also for ML / datascience etc?

nathanwallace · 2024-11-13T16:07:15 1731514035

Steampipe is most commonly used for DevSecOps. It works with any BI tool for dashboards (it's just Postgres), but is really great when paired with Powerpipe [1] for security benchmarks, visualizing infrastructure and much more [2]. It can definitely be used for a wide range of data scenarios, e.g. there is an OpenAI plugin [3].

1 - https://github.com/turbot/powerpipe 2 - https://hub.powerpipe.io 2 - https://github.com/turbot/steampipe-plugin-openai

wvh · 2024-11-13T13:07:42 1731503262

We've been using Steampipe every once in a while and it works as advertised, though perhaps SQL isn't always the abstraction of choice mentally. Congrats on hitting v1!

nathanwallace · 2024-11-13T16:03:15 1731513795

thank you! We believe that SQL makes the simple stuff easy and the hard stuff doable :). Hopefully our repos & hubs have good enough examples to get you started most of the time!

pestaa · 2024-11-13T16:17:58 1731514678

This is super cool.

Last time I searched only https://fdw.dev came up (from Supabase).

mind-blight · 2024-11-13T16:15:35 1731514535

This is super cool! Do you have any integrations with duckdb? It seems like these two could work really well together

nathanwallace · 2024-11-14T04:03:39 1731557019

Beyond the Steampipe CLI (which has an embedded Postgres), you can use the steampipe plugins as native Postgres FDWs [1], SQLite extensions [2] or command line export tools [3]. We're definitely interested in bringing the plugins to duckdb as well, but haven't had a chance to focus there yet!

1 - https://steampipe.io/blog/2023-12-postgres-extensions 2 - https://steampipe.io/blog/2023-12-sqlite-extensions 3 - https://steampipe.io/blog/2023-12-steampipe-export

nkmnz · 2024-11-13T15:28:52 1731511732

the plugins link from your github repo goes to 404: https://hub.powerpipe.io/plugins :)

nathanwallace · 2024-11-13T16:00:49 1731513649

doh! thanks for the heads up :-(. Fixed to be https://hub.steampipe.io/plugins

djbusby · 2024-11-13T14:45:48 1731509148

I think you mean disclosure.

Why would you disclaim something cool?

bbor · 2024-11-13T15:14:27 1731510867

Ooo love this rabbit hole. I’ve always used “disclaimer” as “warning”, and would definitely use it for a conflict of interest, but technically the legal field uses it specifically for asserting a lack of liability (ish).

Certainly interesting, but I think I’m gonna stubbornly stand my descriptivist ground here. We conquered “literally”, and “sneak peak” is soon to fall — I’ll add this to the list!

nathanwallace · 2024-11-13T16:10:23 1731514223

Fair point. I guess that I'm disclosing that I lead the project and disclaiming the self-interest in posting? :)

adhamsalama · 2024-11-13T19:47:30 1731527250

Looks pretty cool. Will try it out.

branko_d · 2024-11-13T15:02:08 1731510128

Following the links, I find...

    pgPDF: The actual PDF parsing is done by poppler.
    Poppler is a PDF rendering library based on the xpdf-3.0 code base.
    Xpdf is based on XpdfWidget/Qt™, by Glyph & Cog.
    XpdfWidget is based on the same proven code used in Glyph & Cog's XpdfViewer library.
    The XpdfViewer® library / ActiveX control provides a PDF file viewer component for use in Windows applications.

Quite the rabbit hole!

Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.

bzzzt · 2024-11-13T15:52:15 1731513135

Beside licensing issues I'm not sure the solution should be to link Poppler (which has a multiple CVE's every year on average) into the database server, especially if you process untrusted data. Seems to be a great way to gain access to the database server.

Functionally it looks useful, but if those kind of 'helpers' catch on there really should be a way to sandbox these 'parser' processes.

sitkack · 2024-11-13T16:08:03 1731514083

Totally agree, this data should be supplied by a "page server" (analogous to a frame server in video production) over http using pdf.js so it can run in a browser based sandbox.

The risks of running this code are just way too high without an org level security policy about what access this compromised machine would have.

jenny91 · 2024-11-13T16:44:37 1731516277

I keep going back and forth trying to figure out if this is sarcasm or not. Firstly it sounds sensible, then you're talking about PDF.js in a browser sandbox?!

sitkack · 2024-11-13T19:29:30 1731526170

Not sarcasm. When parsing untrusted, complex input with untrusted code, one should use multiple isolation domains.

    pgsql -> http -> [ firejail [ deno pdf.js (file:///untrusted.pdf) ] ]

bzzzt · 2024-11-14T15:41:47 1731598907

Ok, so you're using deno with pdf.js instead of poppler. While Javascript is mostly 'memory safe' using all of deno, pdf.js and firejail make your attack surface huge and difficult to review or constrain and probably tank performance if used on a big dataset because you have to initialize the whole stack per request. All three of those tools have had significant CVE's too so adding more layers increases the amount of CVE's you have to deal with. I also don't see what firejail buys you when you constrain deno (or another parser) to a properly secured container or VM.

fforflo · 2024-11-13T16:54:07 1731516847

Good catch, the whole PDF parsing ecosystem is kinda grey on those things. I tried to be extra-careful and I relied on poppler's official statement:

"Note that Poppler is licensed under the GPL, not the LGPL, so programs which call Poppler must be licensed under the GPL as well. See the section History and GPL licensing for more information."

See https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...

karel-3d · 2024-11-13T15:07:39 1731510459

yeah but the xpdf library part of xpdfviewer is open-source (gplv2)

https://www.xpdfreader.com/download.html

note that the source code is not on github, but is just dumped each version as a tar

https://www.xpdfreader.com/old-versions.html

I once needed to have it for some PDF experiments and I put it on github (this is the newest version; I did _not_ go old versions one by one; I just dumped 2 newest versions)

https://github.com/karelbilek/xpdf-mirror

xrd · 2024-11-13T11:53:37 1731498817

This is fun. It would be interesting to add the able to query references inside the page, like images. That could be modeled as a foreign key relationship to the page. I'm using some Python libraries to do that and everything is wrapped in try/except blocks because PDFs are a mess. I wonder how poppler handles those kind of files.

andai · 2024-11-13T12:02:55 1731499375

Fascinating, you wouldn't happen to have the code up would you? What kind of stuff are you doing, and what libraries did you use?

xrd · 2024-11-13T15:29:30 1731511770

Sorry it's not my code to publish. I'm happy to answer general questions over email (on profile page).

Pypdf2 and pillow to process images at a high level.

fforflo · 2024-11-13T16:16:39 1731514599

Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)

Some clarifications on a few comments I see downstream:

The motivating example was to easily support Full-Text Search (FTS) on PDFs with SQL only (see blog post https://tselai.com/full-text-search-pdf-postgres ). You can treat `pdf` as an alias for `text` and do everything possible.

On the next iteration, I made `pdf` a type (typical varlena object of bytes) to avoid hitting disk all the time. The file is loaded from the disk only once (if it's a valid pdf). One can store the `pdf` type (blob of bytes) as a standard Postgres type. And use that for subsequent calls. Postgres will do it's magic as usual. There is a potential next step of storing the parsed document just to save some time from re-parsing the bytes, but I deemed it a premature optimization.

scrlk · 2024-11-13T17:06:25 1731517585

> Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)

It's HN's second-chance pool: https://news.ycombinator.com/item?id=11662380

aargh_aargh · 2024-11-13T11:39:56 1731497996

For one minute I thought - what a stupid idea, wrong level of abstraction. Now I think I might actually use this in an analysis setting for convenience. I guess I'll quickly find out what kinds of timeouts I'll run into once I ask for the titles of 10k documents.

andy_ppp · 2024-11-13T12:13:45 1731500025

I looked at it and initially thought this is a stupid idea to store PDF files in your database and I continued to think this is even worse when it uses files on the file system. If you want this, index your PDF files into a table when they are uploaded.

The only functions here all take a filesystem path which your database should definitely not have access to - why would you upload files/store PDFs on a database server!!?

These functions to be able to get the title or modification time of a PDF are also just not that useful.

robertlagrant · 2024-11-13T12:22:29 1731500549

> why would you upload files/store PDFs on a database server!!?

I don't really understand this question. You can put data where you like. There are no "database servers". These aren't whole toys, take them or leave them. They're made of Lego bricks, and so you can change them.

lmz · 2024-11-13T12:52:45 1731502365

Yes, but clients shouldn't be encouraged to refer to server paths. It's unnecessary exposure of details and makes it harder to move things around. Also I wonder if this requires extra permissions. I know COPY from server files requires extra permissions:

> COPY naming a file or command is only allowed to database superusers or users who are granted one of the roles pg_read_server_files, pg_write_server_files, or pg_execute_server_program, since it allows reading or writing any file or running a program that the server has privileges to access.

robertlagrant · 2024-11-13T13:09:55 1731503395

What do you mean by clients in this case?

branko_d · 2024-11-13T15:15:36 1731510936

It is perfectly legitimate (and in fact preferable in many cases) to store files in database BLOBs. Backups, consistency, security and (yes!) scalability come to mind.

Some early BLOB implementations had performance problems, which is where this notion that "you shouldn't store files in BLOBs" seems to come from. But modern DBMSes have fast BLOBs, especially if used properly through their streaming API (don't materialize the entire file in memory!).

We have a system in production that stores millions of files in BLOBs, some of them reaching multi-GB sizes, being accessed across the globe (a big enterprise company) by thousands of engineers, and never had performance problems with BLOBs.

claudex · 2024-11-13T13:09:13 1731503353

> The only functions here all take a filesystem path

From the documentation:

> Creating a pdf type, by casting either text path or bytea blob.

With the example: SELECT ''::bytea::pdf;

So it's convenient to use the path to test quickly, but you can use anything in PostgreSQL which return (or can be convert) a bytea

andy_ppp · 2024-11-13T13:37:03 1731505023

So potentially there’s a full text search case (?) for doing this but I’d always just default to adding documents like this to elastic search instead. I just cannot see a good use case for this and lots of bad ones.

vundercind · 2024-11-13T14:43:16 1731508996

It's probably mainly aimed at data pipeline use cases. It's common to have stages of very "messy" data that you'll want to clean up. This lets you keep that part of it all in Postgres, if you like, if your messy data is PDFs, which is a fairly common case.

Not every object in a database needs to be ready for public consumption. Some of it's there for processing, for certain use cases.

fforflo · 2024-11-13T16:16:00 1731514560

The first iteration of this was actually FTS only https://tselai.com/full-text-search-pdf-postgres

In the second pass I made `pdf` a type.

fernandohur · 2024-11-13T12:12:07 1731499927

The postgres ecosystem keeps impressing me with it's creativity

ok123456 · 2024-11-13T14:33:47 1731508427

It would be neat to see this as a TOAST type in Postgres, where the PDF was kept in a data structure with the PDF parsed. It would be relatively straightforward to perform searches and index/reindex deep into the documents.

fforflo · 2024-11-13T16:41:28 1731516088

I'm missing something I think. Currently the `pdf` type is indeed a blob of bytes. Read from disk, but not parsed. The type is created with storage=extended to avoid the typical 8KB tuple size limit.

In practice though, a PDF for most cases has text-like semantics, so with ::pdf::text you can have all the text-indexes you want.

ok123456 · 2024-11-13T18:45:46 1731523546

This way, you could keep pagination intact and only retrieve individual pages or sections on demand.

dennisy · 2024-11-13T18:19:42 1731521982

Whilst this is cool, why would we want to push this logic into the DB?

It seems cleaner to keep this in the service layer and use any PDF parsing library and subsequent schema to store the parsed files.

LunaSea · 2024-11-13T11:24:45 1731497085

Interesting!

I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?

cha42 · 2024-11-13T11:41:03 1731498063

It can be useful for improving ingestion pipeline: put your pdf collection in a temp table and then extract with pure SQL the information you want.

ape4 · 2024-11-13T13:28:01 1731504481

Slightly related - Are PDFs natively compressed? They would probably compress well since they're often mostly text. Saving space in the database.

ZiiS · 2024-11-13T15:45:21 1731512721

Yes; but so are any PostgreSQL columns the size of PDF blobs.

aidos · 2024-11-13T14:00:03 1731506403

As a general rule, yes they are compressed. As you point out, they compress really well.

skwee357 · 2024-11-13T15:03:29 1731510209

Im having difficulty to understand what’s the use case for this..

kachapopopow · 2024-11-13T15:11:36 1731510696

Instead of having to have 27 columns you have one... I guess?

This also makes interop more difficult when working with indexing databases (i.e. elastic).

This does minimize the amount of client code required to parse pdf's though.

skwee357 · 2024-11-13T15:33:31 1731512011

But then it means I need to save the PDF in the database rather than some kind of storage

kachapopopow · 2024-11-13T16:28:28 1731515308

psql is very good at that since 9.0

bbor · 2024-11-13T15:10:05 1731510605

RAG! AFAICT it doesn’t do OCR off the shelf, but surely you could hook up Nougat/Tesseract/TOCR without too much difficulty and start parsing any PDFs that’s aren’t parsed yet.

anonu · 2024-11-13T12:21:55 1731500515

Now create a dbeaver extension to view the type