> *At that point, your PDF isn't super-compressed (don't know how to get those)*...

generationP · on Aug 21, 2022

Oh, I should have said that I scan in grayscale, but ScanTailor (at stage 6) makes the output monochrome; that's what the slider is about (it determines the boundary between what will become black and what will become white). So this isn't what I'm missing.

I am not sure if the result is G4-compressed, though. Is there a quick way to tell?

crazygringo · on Aug 21, 2022

On my system I can run 'pdfimages -list' on a PDF it gives me all the images in a PDF with their encoding format. The utility comes with 'poppler-utils' I believe.

And I'm just now discovering by checking on my own PDF's, that 'ocrmypdf' will automatically convert Group4 to lossless JBIG2 (if optimizations are enabled) which is supposedly even more efficient for monochrome -- but encoders aren't always available [1].

I don't think ImageMagick has been updated yet to support outputting JBIG2 for PDF's.

[1] https://ocrmypdf.readthedocs.io/en/latest/jbig2.html

generationP · on Aug 22, 2022

Ah. It says the encoding is ccitt, which I hope is indeed the same as Group4.

How is the lossless JBIG2 in terms of reading speed? I've seen some very well-compressed PDF files around that unfortunately load so slow that they are almost unreadable on mobile; I think this was JBIG2. In that case, I'm wondering if this can be avoided by proper use or is a necessary downside of the encoding.

crazygringo · on Aug 23, 2022

As fast as anything else for all practical purposes.

I too have encountered molasses-slow PDF's, and I can't even begin to guess what causes that. Book PDF's from OpenLibrary are often like that for me. Like it genuinely makes me wonder if it's producing each page's image with embedded JavaScript writing to a canvas or something... except that might actually still be faster.

homarp · on Aug 22, 2022

beware with JBIG: "Undetectable Data Corruption in JB2/JBIG2" https://news.ycombinator.com/item?id=32537073

generationP · on Aug 22, 2022

I suspect this is referring to the lossy version of JBIG2 (which is essentially the same as DjVU encoding).

kragen · on Aug 22, 2022

I really appreciate having grayscale or color scans of books rather than bilevel black and white. It's often much easier to read, and often illustrations come out mangled into illegibility by thresholding. Occasionally even text does.

crazygringo · on Aug 23, 2022

I do too, but I find that they're just too big in file size.

Bilevel at around 300 DPI means scanned books that run 2-5 MB. Grayscale/color tends to mean 10-50 MB.

For me it's less about the storage and more about performance -- for everything from downloading to copying to e-mailing to previews to autosaving while highlighting, apps and cloud services seem to cope well and quickly with 3 MB PDF's, but just seem to slow down dramatically with 30 MB ones.