Amazing, I've been yearning for something like this for years but have always be...

eurasiantiger · on Sept 18, 2021

It is literally impossible for all PDFs, since some of them may not have any kind of semantic structure and consist of a set of graph bitmaps laid out at specific coordinates to make up blocks of text.

anonymouse008 · on Sept 18, 2021

The best PDF bug is when the linker between the Adobe character value and the font/language is broken and you get random Unicode like values with no way to connect the two.

Infuriating.

dunham · on Sept 18, 2021

Some PDF files intentionally include a bad character mapping table (and reorder the font) as a form of DRM.

mkl · on Sept 18, 2021

It's not very effective against anyone determined though. You can OCR easily. You can also rebuild the character mapping from the shapes of the glyphs, and in most languages there are few enough that you can even do it by hand.

mcswell · on Sept 18, 2021

You can OCR if you're using a Latin script with few if any accent marks. Depending on your OCR engine, I suppose you could do Cyrillic too. Other scripts, not so much. (And yes, I'm a computational linguist, so we deal with non-Roman scripts all the time, particularly Arabic script. But I suppose that's not a problem for most people here :-).)

There might be some Latin script fonts that cause problems, but I haven't looked into that very much--I do recall we had problems with an italic font.

dunham · on Sept 18, 2021

When I came across this I already had a pristine copy of the font, so I just compared the program for each character to determine the mapping. (I was automating the decoding.) I agree that there is little to no security there.

But the point that I was not so clearly trying to make was that sometimes the messed up encoding is intentional and not a bug.