This topic comes up periodically as most people think PDFs are some impenetrable...

azangru · on Sept 5, 2023

> I’m a sadist, I read it for fun.

I think this is called masochist. Now, if you participated in writing the spec or were making others read it...

aidos · on Sept 5, 2023

Yup, slip of the tongue. Though, I do make other people read the spec at work, so I’m that too.

blur13 · on Sept 5, 2023

a sadist is a masochist who follows the golden rule

esafak · on Sept 4, 2023

It is a shame Adobe designed a format so hard to work with that people are amazed when someone accomplishes what should be a basic task with it.

Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??

pwg · on Sept 4, 2023

> It is a shame Adobe designed a format so hard to work with

PDF was not designed to be editable, nor for anyone to "work with" it in that way.

It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.

It was originally meant to be "electronic paper".

dylan604 · on Sept 4, 2023

Wasn't the PDF format based on the Illustrator format?

The weird thing to me is people using a distribution format as an original source. It's right up there with video cameras shooting an acquisition source as an MP4 and all of the negative baggage that comes with that.

mistrial9 · on Sept 5, 2023

1.4.4 Portable Document Format (PDF) Adobe has specified another format, PDF, for portable representation of electronic documents. PDF is documented in the Portable Document Format Reference Manual. PDF and the PostScript language share the same underlying Adobe imaging model. A document can be converted straightforwardly between PDF and the PostScript language; the two representations produce the same output when printed. However, PDF lacks the general-purpose programming language framework of the PostScript language. A PDF document is a static data structure that is designed for efficient random access and includes navigational information suitable for interactive viewing.

-- https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf

j45 · on Sept 9, 2023

This is a very valuable link - just generate PDFs yourself by hand or script.

Denvercoder9 · on Sept 5, 2023

> The weird thing to me is people using a distribution format as an original source.

Every distribution format will inevitably end up being used as a source; as the originals get lost in the mists of time.

userbinator · on Sept 4, 2023

I believe Illustrator format is very similar to PostScript.

mistrial9 · on Sept 4, 2023

.. waves to Leonard Rosenthol

gobdovan · on Sept 5, 2023

If you find pleasure in something that gives you pain, you're a masochist. A sadist likes inflicting pain onto others. Since you seem that you like helping people I'd say it's more likely you're the former. I appreciate the mutool advice!

haolez · on Sept 5, 2023

That's awesome. I'm relying a lot on Amazon Textract for my PDF parsing needs.

Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.

kccqzy · on Sept 5, 2023

PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.

haolez · on Sept 5, 2023

Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?

aidos · on Sept 5, 2023

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.