If you are referring to my duplication comments, sure (but even then I believe t...

xenr1k · on Aug 21, 2022

I agree. There are duplicates. I have seen it.

I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.

MichaelCollins · on Aug 21, 2022

How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.

kragen · on Aug 22, 2022

My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.

macintux · on Aug 21, 2022

I don’t think anyone is arguing it can be fully automated, but automating the selection of books to manually review is certainly viable.

kragen · on Aug 22, 2022

I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).

rinze · on Aug 21, 2022

As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.