This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.
They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).
My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.
In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.
There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).
> It is a shame Adobe designed a format so hard to work with
PDF was not designed to be editable, nor for anyone to "work with" it in that way.
It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.
Wasn't the PDF format based on the Illustrator format?
The weird thing to me is people using a distribution format as an original source. It's right up there with video cameras shooting an acquisition source as an MP4 and all of the negative baggage that comes with that.
1.4.4 Portable Document Format (PDF)
Adobe has specified another format, PDF, for portable representation of electronic documents. PDF is documented in the Portable Document Format Reference
Manual.
PDF and the PostScript language share the same underlying Adobe imaging
model. A document can be converted straightforwardly between PDF and the
PostScript language; the two representations produce the same output when
printed. However, PDF lacks the general-purpose programming language framework of the PostScript language. A PDF document is a static data structure that is
designed for efficient random access and includes navigational information suitable for interactive viewing.
If you find pleasure in something that gives you pain, you're a masochist. A sadist likes inflicting pain onto others. Since you seem that you like helping people I'd say it's more likely you're the former. I appreciate the mutool advice!
PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.
Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?
Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.
They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).
My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.
In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.
There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).