> At that point, your PDF isn't super-compressed (don't know how to get those)
As far as I know, it's making sure your text-only pages are monochrome (not grayscale) and to use Group4 compression for them, which is actually what fax machines use (!) and is optimized specifically for monochrome text. Both TIFF and PDF's support Group4 -- I use ImageMagick to take a scanned input page and run grayscale, contrast, Group4 monochrome encoding, and PDF conversion in one fell swoop which generates one PDF per page, and then "pdfunite" to join the pages. Works like a charm.
I'm not aware of anything superior to Group4 for regular black and white text pages, but would love to know if there is.
Oh, I should have said that I scan in grayscale, but ScanTailor (at stage 6) makes the output monochrome; that's what the slider is about (it determines the boundary between what will become black and what will become white). So this isn't what I'm missing.
I am not sure if the result is G4-compressed, though. Is there a quick way to tell?
On my system I can run 'pdfimages -list' on a PDF it gives me all the images in a PDF with their encoding format. The utility comes with 'poppler-utils' I believe.
And I'm just now discovering by checking on my own PDF's, that 'ocrmypdf' will automatically convert Group4 to lossless JBIG2 (if optimizations are enabled) which is supposedly even more efficient for monochrome -- but encoders aren't always available [1].
I don't think ImageMagick has been updated yet to support outputting JBIG2 for PDF's.
Ah. It says the encoding is ccitt, which I hope is indeed the same as Group4.
How is the lossless JBIG2 in terms of reading speed? I've seen some very well-compressed PDF files around that unfortunately load so slow that they are almost unreadable on mobile; I think this was JBIG2. In that case, I'm wondering if this can be avoided by proper use or is a necessary downside of the encoding.
As fast as anything else for all practical purposes.
I too have encountered molasses-slow PDF's, and I can't even begin to guess what causes that. Book PDF's from OpenLibrary are often like that for me. Like it genuinely makes me wonder if it's producing each page's image with embedded JavaScript writing to a canvas or something... except that might actually still be faster.
I really appreciate having grayscale or color scans of books rather than bilevel black and white. It's often much easier to read, and often illustrations come out mangled into illegibility by thresholding. Occasionally even text does.
I do too, but I find that they're just too big in file size.
Bilevel at around 300 DPI means scanned books that run 2-5 MB. Grayscale/color tends to mean 10-50 MB.
For me it's less about the storage and more about performance -- for everything from downloading to copying to e-mailing to previews to autosaving while highlighting, apps and cloud services seem to cope well and quickly with 3 MB PDF's, but just seem to slow down dramatically with 30 MB ones.
As far as I know, it's making sure your text-only pages are monochrome (not grayscale) and to use Group4 compression for them, which is actually what fax machines use (!) and is optimized specifically for monochrome text. Both TIFF and PDF's support Group4 -- I use ImageMagick to take a scanned input page and run grayscale, contrast, Group4 monochrome encoding, and PDF conversion in one fell swoop which generates one PDF per page, and then "pdfunite" to join the pages. Works like a charm.
I'm not aware of anything superior to Group4 for regular black and white text pages, but would love to know if there is.