I don't think OP takes into account that there seem to be multiple editions of t...

generationP · on Aug 21, 2022

In practice, it's more often the same file with minor edits such as a PDF table of contents added or page numbers corrected. Say, how many distinct editions of this standard text on elementary algebraic geometry are in the following list?

http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o...

Fun fact: the newest one (the 2018 corrected version of the 2015 fourth edition) is not among them.

boarush · on Aug 21, 2022

I like to think that LibGen also serves as a historical database wherein there is a record that a book of a specific edition had its errors corrected. (Although it would be better if errata could be appended to the same file if possible)

Yes, for very minor edits, those files should obviously not exist, but for that there would need to be someone who verifies this, which is such an enormous task that likely no one would take up.

ZeroGravitas · on Aug 21, 2022

I notice they have a place to store the OpenLibrary ID, though I've not seen one filled in as yet.

OpenLibrary provides both Work and Edition ids, which helps connect different versions.

Their database is not perfect either, but it might make more sense to keep the bibliographic data seperate from the copyright contents anyway.

https://openlibrary.org/works/OL1849157W/Ideals_varieties_an...

liberalgeneral · on Aug 21, 2022

If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.

xenr1k · on Aug 21, 2022

I agree. There are duplicates. I have seen it.

I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.

MichaelCollins · on Aug 21, 2022

How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.

kragen · on Aug 22, 2022

My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.

macintux · on Aug 21, 2022

I don’t think anyone is arguing it can be fully automated, but automating the selection of books to manually review is certainly viable.

kragen · on Aug 22, 2022

I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).

rinze · on Aug 21, 2022

As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.