Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)
That's a great piece of software indeed. I use it to join and analyze data from multiple APIs with materialized view, as most data analysis software only connects to databases, not APIs.
It's also pretty easy to write custom plugins once you understand how it's done.
Interesting! Can you tell me the typical use-case for this? Is it more for dev-ops stuff? Like to query all your servers to do some dashboards of your infra? Or can that be used also for ML / datascience etc?
Steampipe is most commonly used for DevSecOps. It works with any BI tool for dashboards (it's just Postgres), but is really great when paired with Powerpipe [1] for security benchmarks, visualizing infrastructure and much more [2]. It can definitely be used for a wide range of data scenarios, e.g. there is an OpenAI plugin [3].
We've been using Steampipe every once in a while and it works as advertised, though perhaps SQL isn't always the abstraction of choice mentally. Congrats on hitting v1!
thank you! We believe that SQL makes the simple stuff easy and the hard stuff doable :). Hopefully our repos & hubs have good enough examples to get you started most of the time!
Beyond the Steampipe CLI (which has an embedded Postgres), you can use the steampipe plugins as native Postgres FDWs [1], SQLite extensions [2] or command line export tools [3]. We're definitely interested in bringing the plugins to duckdb as well, but haven't had a chance to focus there yet!
Ooo love this rabbit hole. I’ve always used “disclaimer” as “warning”, and would definitely use it for a conflict of interest, but technically the legal field uses it specifically for asserting a lack of liability (ish).
Certainly interesting, but I think I’m gonna stubbornly stand my descriptivist ground here. We conquered “literally”, and “sneak peak” is soon to fall — I’ll add this to the list!
pgPDF: The actual PDF parsing is done by poppler.
Poppler is a PDF rendering library based on the xpdf-3.0 code base.
Xpdf is based on XpdfWidget/Qt™, by Glyph & Cog.
XpdfWidget is based on the same proven code used in Glyph & Cog's XpdfViewer library.
The XpdfViewer® library / ActiveX control provides a PDF file viewer component for use in Windows applications.
Quite the rabbit hole!
Any licensing complications? Is it cross-platform? XpdfViewer seems to be propriatary and Windows-only.
Beside licensing issues I'm not sure the solution should be to link Poppler (which has a multiple CVE's every year on average) into the database server, especially if you process untrusted data.
Seems to be a great way to gain access to the database server.
Functionally it looks useful, but if those kind of 'helpers' catch on there really should be a way to sandbox these 'parser' processes.
Totally agree, this data should be supplied by a "page server" (analogous to a frame server in video production) over http using pdf.js so it can run in a browser based sandbox.
The risks of running this code are just way too high without an org level security policy about what access this compromised machine would have.
I keep going back and forth trying to figure out if this is sarcasm or not. Firstly it sounds sensible, then you're talking about PDF.js in a browser sandbox?!
Ok, so you're using deno with pdf.js instead of poppler. While Javascript is mostly 'memory safe' using all of deno, pdf.js and firejail make your attack surface huge and difficult to review or constrain and probably tank performance if used on a big dataset because you have to initialize the whole stack per request. All three of those tools have had significant CVE's too so adding more layers increases the amount of CVE's you have to deal with.
I also don't see what firejail buys you when you constrain deno (or another parser) to a properly secured container or VM.
Good catch, the whole PDF parsing ecosystem is kinda grey on those things.
I tried to be extra-careful and I relied on poppler's official statement:
"Note that Poppler is licensed under the GPL, not the LGPL, so
programs which call Poppler must be licensed under the GPL as well.
See the section History and GPL
licensing for more information."
I once needed to have it for some PDF experiments and I put it on github (this is the newest version; I did _not_ go old versions one by one; I just dumped 2 newest versions)
This is fun. It would be interesting to add the able to query references inside the page, like images. That could be modeled as a foreign key relationship to the page. I'm using some Python libraries to do that and everything is wrapped in try/except blocks because PDFs are a mess. I wonder how poppler handles those kind of files.
Interesting: I posted this a few days back, and certainly not "10 hours ago". Who was kind enough to re-surface this? Thanks :)
Some clarifications on a few comments I see downstream:
The motivating example was to easily support Full-Text Search (FTS) on PDFs with SQL only (see blog post https://tselai.com/full-text-search-pdf-postgres ).
You can treat `pdf` as an alias for `text` and do everything possible.
On the next iteration, I made `pdf` a type (typical varlena object of bytes) to avoid hitting disk all the time. The file is loaded from the disk only once (if it's a valid pdf). One can store the `pdf` type (blob of bytes) as a standard Postgres type. And use that for subsequent calls. Postgres will do it's magic as usual.
There is a potential next step of storing the parsed document just to save some time from re-parsing the bytes, but I deemed it a premature optimization.
For one minute I thought - what a stupid idea, wrong level of abstraction.
Now I think I might actually use this in an analysis setting for convenience.
I guess I'll quickly find out what kinds of timeouts I'll run into once I ask for the titles of 10k documents.
I looked at it and initially thought this is a stupid idea to store PDF files in your database and I continued to think this is even worse when it uses files on the file system. If you want this, index your PDF files into a table when they are uploaded.
The only functions here all take a filesystem path which your database should definitely not have access to - why would you upload files/store PDFs on a database server!!?
These functions to be able to get the title or modification time of a PDF are also just not that useful.
> why would you upload files/store PDFs on a database server!!?
I don't really understand this question. You can put data where you like. There are no "database servers". These aren't whole toys, take them or leave them. They're made of Lego bricks, and so you can change them.
Yes, but clients shouldn't be encouraged to refer to server paths. It's unnecessary exposure of details and makes it harder to move things around. Also I wonder if this requires extra permissions. I know COPY from server files requires extra permissions:
> COPY naming a file or command is only allowed to database superusers or users who are granted one of the roles pg_read_server_files, pg_write_server_files, or pg_execute_server_program, since it allows reading or writing any file or running a program that the server has privileges to access.
It is perfectly legitimate (and in fact preferable in many cases) to store files in database BLOBs. Backups, consistency, security and (yes!) scalability come to mind.
Some early BLOB implementations had performance problems, which is where this notion that "you shouldn't store files in BLOBs" seems to come from. But modern DBMSes have fast BLOBs, especially if used properly through their streaming API (don't materialize the entire file in memory!).
We have a system in production that stores millions of files in BLOBs, some of them reaching multi-GB sizes, being accessed across the globe (a big enterprise company) by thousands of engineers, and never had performance problems with BLOBs.
So potentially there’s a full text search case (?) for doing this but I’d always just default to adding documents like this to elastic search instead. I just cannot see a good use case for this and lots of bad ones.
It's probably mainly aimed at data pipeline use cases. It's common to have stages of very "messy" data that you'll want to clean up. This lets you keep that part of it all in Postgres, if you like, if your messy data is PDFs, which is a fairly common case.
Not every object in a database needs to be ready for public consumption. Some of it's there for processing, for certain use cases.
It would be neat to see this as a TOAST type in Postgres, where the PDF was kept in a data structure with the PDF parsed. It would be relatively straightforward to perform searches and index/reindex deep into the documents.
I'm missing something I think.
Currently the `pdf` type is indeed a blob of bytes. Read from disk, but not parsed.
The type is created with storage=extended to avoid the typical 8KB tuple size limit.
In practice though, a PDF for most cases has text-like semantics, so with ::pdf::text you can have all the text-indexes you want.
I wonder what the use case is compared to extracting this information in the programming language and then storing it alongside the PDF in separate table columns?
RAG! AFAICT it doesn’t do OCR off the shelf, but surely you could hook up Nougat/Tesseract/TOCR without too much difficulty and start parsing any PDFs that’s aren’t parsed yet.
1 - https://github.com/turbot/steampipe