Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.

They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).

My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.

In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.

There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).



> I’m a sadist, I read it for fun.

I think this is called masochist. Now, if you participated in writing the spec or were making others read it...


Yup, slip of the tongue. Though, I do make other people read the spec at work, so I’m that too.


a sadist is a masochist who follows the golden rule


It is a shame Adobe designed a format so hard to work with that people are amazed when someone accomplishes what should be a basic task with it.

Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??


> It is a shame Adobe designed a format so hard to work with

PDF was not designed to be editable, nor for anyone to "work with" it in that way.

It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.

It was originally meant to be "electronic paper".


Wasn't the PDF format based on the Illustrator format?

The weird thing to me is people using a distribution format as an original source. It's right up there with video cameras shooting an acquisition source as an MP4 and all of the negative baggage that comes with that.


1.4.4 Portable Document Format (PDF) Adobe has specified another format, PDF, for portable representation of electronic documents. PDF is documented in the Portable Document Format Reference Manual. PDF and the PostScript language share the same underlying Adobe imaging model. A document can be converted straightforwardly between PDF and the PostScript language; the two representations produce the same output when printed. However, PDF lacks the general-purpose programming language framework of the PostScript language. A PDF document is a static data structure that is designed for efficient random access and includes navigational information suitable for interactive viewing.

-- https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf


This is a very valuable link - just generate PDFs yourself by hand or script.


> The weird thing to me is people using a distribution format as an original source.

Every distribution format will inevitably end up being used as a source; as the originals get lost in the mists of time.


I believe Illustrator format is very similar to PostScript.


.. waves to Leonard Rosenthol


If you find pleasure in something that gives you pain, you're a masochist. A sadist likes inflicting pain onto others. Since you seem that you like helping people I'd say it's more likely you're the former. I appreciate the mutool advice!


That's awesome. I'm relying a lot on Amazon Textract for my PDF parsing needs.

Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.


PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.


Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?


This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: