Hacker News new | past | comments | ask | show | jobs | submit login

I don't think OP takes into account that there seem to be multiple editions of the same book which are often required by people to refer to. Not everyone wants the latest edition when the class you're in is using some old edition.



In practice, it's more often the same file with minor edits such as a PDF table of contents added or page numbers corrected. Say, how many distinct editions of this standard text on elementary algebraic geometry are in the following list?

http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o...

Fun fact: the newest one (the 2018 corrected version of the 2015 fourth edition) is not among them.


I like to think that LibGen also serves as a historical database wherein there is a record that a book of a specific edition had its errors corrected. (Although it would be better if errata could be appended to the same file if possible)

Yes, for very minor edits, those files should obviously not exist, but for that there would need to be someone who verifies this, which is such an enormous task that likely no one would take up.


I notice they have a place to store the OpenLibrary ID, though I've not seen one filled in as yet.

OpenLibrary provides both Work and Edition ids, which helps connect different versions.

Their database is not perfect either, but it might make more sense to keep the bibliographic data seperate from the copyright contents anyway.

https://openlibrary.org/works/OL1849157W/Ideals_varieties_an...


If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.


I agree. There are duplicates. I have seen it.

I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.


How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.


My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.


I don’t think anyone is arguing it can be fully automated, but automating the selection of books to manually review is certainly viable.


I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).


As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: