Hacker News new | past | comments | ask | show | jobs | submit login

On my system I can run 'pdfimages -list' on a PDF it gives me all the images in a PDF with their encoding format. The utility comes with 'poppler-utils' I believe.

And I'm just now discovering by checking on my own PDF's, that 'ocrmypdf' will automatically convert Group4 to lossless JBIG2 (if optimizations are enabled) which is supposedly even more efficient for monochrome -- but encoders aren't always available [1].

I don't think ImageMagick has been updated yet to support outputting JBIG2 for PDF's.

[1] https://ocrmypdf.readthedocs.io/en/latest/jbig2.html




Ah. It says the encoding is ccitt, which I hope is indeed the same as Group4.

How is the lossless JBIG2 in terms of reading speed? I've seen some very well-compressed PDF files around that unfortunately load so slow that they are almost unreadable on mobile; I think this was JBIG2. In that case, I'm wondering if this can be avoided by proper use or is a necessary downside of the encoding.


As fast as anything else for all practical purposes.

I too have encountered molasses-slow PDF's, and I can't even begin to guess what causes that. Book PDF's from OpenLibrary are often like that for me. Like it genuinely makes me wonder if it's producing each page's image with embedded JavaScript writing to a canvas or something... except that might actually still be faster.


beware with JBIG: "Undetectable Data Corruption in JB2/JBIG2" https://news.ycombinator.com/item?id=32537073


I suspect this is referring to the lossy version of JBIG2 (which is essentially the same as DjVU encoding).




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: