This is a bit anecdotal, but I did upload a book to libgen. I am am avid user of the site, and during my thesis research I was looking for a specific book and could not find it on there. I did however find it on archive.org. I spent the better half of one afternoon extracting the book from archive.org with some Adobe software, since I had to circumvent some DRM and other things, and all of this was also novel to me. In the end I got a scanned PDF, which had several hundred MB. I managed to reduce it to 47 MB, however further reduction was not easily possible at least not with the means I knew or had at my disposal. I uploaded this version to libgen.
I do agree that there may be some large files on there, however I dont agree with removing them. I spent some
hours to put this book on there so others who need it can access it within seconds. Removing it because it is too large would void all this effort and require future users to go through a similar process than i did just to browse through the book.
Also any book published today is most likely available in some ebook format, which is much smaller in size, so I dont think that the size of libgen will continue to grow at the same pace as it is doing now.
To be clear, I am not advocating for the removal of any files larger than 30 MiB (or any other arbitrary hard limits). It'd be great of course to flag large files for further review, but the current software doesn't do a great job at crowdsourcing these kinds of tasks (another one being deduplication) sadly.
Given the very little amount of volunteer-power, I'm suggesting that a "lean edition" of LibGen can still be immensely useful to many people.
Files are a very bad unit to elevate in importance, and number of files or file size are really bad proxy metrics, especially without considering the statistical distribution of downloads (leave alone the question of what is more "important"!). Eg: Junk that’s less than the size limit is implicitly being valued over good content that happens to be larger in size. Textbooks & reference books will likewise get filtered out with higher likelihood — and that would screw students in countries where they cannot afford them (which might arguable be a more important audience to some, compared to those downloading comics). Etc.
After all this, the most likely human response from people who really depend on this platform would be to slice a big file into volumes under the size limit. Seems to be a horrible UX downgrade in the medium to long term for no other reason than satisfying some arbitrary metric of legibility[1].
Here's a different idea -- might it be worthwhile to convert the larger files to better compressed versions eg. PDF -> DJVU? This would lead to a duplication in the medium term, but if one sees a convincing pattern that users switch to the compressed versions without needing to come back to the larger versions, that would imply that the compressed version works and the larger version could eventually be garbage collected.
Thinking in an even more open-ended manner, if this corpus is not growing at a substantial rate, can we just wait out a decade or so of storage improvements before this becomes a non-issue? How long might it take for storage to become 3x, 10x, 30x cheaper?
> can we just wait out a decade or so of storage improvements before this becomes a non-issue?
I'm not sure that there is anything on the horizon which would make duplicate data a 'non-issue'. Capacities are certainly growing, so within a decade we might see 100TB HDDs available and affordable 20TB SSDs. But that does not solve the bandwidth issues. It still takes a long, long time to transfer all the data.
The fastest HDD is still under 300MB/s which means it takes a minimum of 20 hours to read all the data off a 20TB HDD. That is if you could somehow get it to read the whole thing at the maximum sustained read speed.
SSDs are much faster, but it will always be easier to double the capacity than it is to double the speed.
The problem isn't the technology, it's the cost. Given a far larger budget, you wouldn't run the hard drives at anywhere near capacity, in order to gain a read speed advantage by running a ton in parallel. That'll let you read 20 TB in a hour if you can afford it. Put it this way; Netflix is able to do 4k video and that's far more intensive.
There's people that contribute to the LibGen ecosystem but unfortunately it in areas that don't really benefit the community. Users don't need another CLI tool for LibGen, nor does the community need another Bot. Unfortunately that's what folks do, make extensions, CLI tools and bots that benefit next to no one and release all over silly willy with no support.
There's a bunch. Here's what I do (for black-and-white text; I'm not sure how to deal with more complex scenarios):
Scan with 600dpi resolution. Nevermind that this gives huge output files; you'll compress them to something much smaller at the end, and the better your resolution, the stronger compression you can use without losing readability.
While scanning, periodically clean the camera or the scanner screen, to avoid speckles of dirt on the scan.
The ideal output formats are TIF and PNG; use them if your scanner allows. PDF is also fine (you'll then have to extract the pages into TIF using pdfimages or using ScanKromsator). Use JPG only as a last resort, if nothing else works.
Once you have TIF, PNG or JPG files, put them into a folder. Make sure that the files are sorted correctly: IIRC, the numbers in their names should match their order (i.e., blob030 must be an earlier page than blah045; it doesn't matter whether the numbers are contiguous or what the non-numerical characters are). (I use the shell command mmv for convenient renaming.)
Stage 1 (Fix Orientation): Use the arrow buttons to make sure all text is upright. Use Q and W to move between pages.
Stage 2 (Split Pages): You can auto-run this using the |> button, but you should check that the result is correct. It doesn't always detect the page borders correctly. (Again, use Q and W to move between pages.)
Stage 3 (Deskew): Auto-run using |>. This is supposed to ensure that all text is correctly rotated. If some text is still skew, you can detect and fix this later.
Stage 4 (Select Content): This is about cutting out the margins. This is the most grueling and boring stage of the process. You can auto-run it using |>, but it will often cut off too much and you'll have to painstakingly fix it by hand. Alternatively (and much more quickly), set "Content Box" to "Disable" and manually cut off the most obvious parts without trying to save every single pixel. Don't worry: White space will not inflate the size of the ultimate file; it compresses well. The important thing is to cut off the black/grey parts beyond the pages. In this process, you'll often discover problems with your scan or with previous stages. You can always go back to previous stages to fix them.
Stage 5 (Margins): I auto-run this.
Stage 6 (Output): This is important to get right. The despeckling algorithm often breaks formulas (e.g., "..."s get misinterpreted as speckles and removed), so I typically uncheck "Despeckle" when scanning anything technical (it's probably fine for fiction). I also tend to uncheck "Savitzki-Golay smoothing" and "Morphological smoothing" for some reason; don't remember why (probably they broke something for me in some case). The "threshold" slider is important: Experiment with it! (Check which value makes a typical page of your book look crisp. Be mindful of pages that are paler or fatter than others. You can set it for each page separately, but most of the time it suffices to find one value for the whole book, except perhaps the cover.) Note the "Apply To..." buttons; they allow you to promote a setting from a single page to the whole book. (Keep in mind that there are two -- the second one is for the despeckling setting.)
Now look at the tab on the right of the page. You should see "Output" as the active one, but you can switch to "Fill Zones". This lets you white-out (or black-out) certain regions of the page. This is very useful if you see some speckles (or stupid write-ins, or other imperfections) that need removal. I try not to be perfectionistic: The best way to avoid large speckles is by keeping the scanner clean at the scanning stage; small ones aren't too big a deal; I often avoid this stage unless I know I got something dirty. Some kinds of speckles (particularly those that look like mathematical symbols) can be confusing in a scan.
There is also a "Picture Zones" rider for graphics and color; that's beyond my paygrade.
Auto-run stage 6 again at the end (even if you think you've done everything -- it needs to recompile the output TIFFs).
Now, go to the folder where you have saved your project, and more precisely to its "out/" subfolder. You should see a bunch of .tif files, each one corresponding to a page. Your goal is to collect them into one PDF. I usually do this as follows:
Thus you end up with a PDF in the folder in which your project is.
Optional: add OCR to it; add bookmarks for chapters and sections; add metadata; correct the page numbering (so that page 1 is actual page 1). I use PDF-XChangeLite for this all; but use whatever tool you know best.
At that point, your PDF isn't super-compressed (don't know how to get those), but it's reasonable (about 10MB per 200 pages), and usually the quality is almost professional.
Uploading to LibGen... well, I think they've made the UI pretty intuitive these days :)
PS. If some of this is out of date or unnecessarily complicated, I'd love to hear!
> At that point, your PDF isn't super-compressed (don't know how to get those)
As far as I know, it's making sure your text-only pages are monochrome (not grayscale) and to use Group4 compression for them, which is actually what fax machines use (!) and is optimized specifically for monochrome text. Both TIFF and PDF's support Group4 -- I use ImageMagick to take a scanned input page and run grayscale, contrast, Group4 monochrome encoding, and PDF conversion in one fell swoop which generates one PDF per page, and then "pdfunite" to join the pages. Works like a charm.
I'm not aware of anything superior to Group4 for regular black and white text pages, but would love to know if there is.
Oh, I should have said that I scan in grayscale, but ScanTailor (at stage 6) makes the output monochrome; that's what the slider is about (it determines the boundary between what will become black and what will become white). So this isn't what I'm missing.
I am not sure if the result is G4-compressed, though. Is there a quick way to tell?
On my system I can run 'pdfimages -list' on a PDF it gives me all the images in a PDF with their encoding format. The utility comes with 'poppler-utils' I believe.
And I'm just now discovering by checking on my own PDF's, that 'ocrmypdf' will automatically convert Group4 to lossless JBIG2 (if optimizations are enabled) which is supposedly even more efficient for monochrome -- but encoders aren't always available [1].
I don't think ImageMagick has been updated yet to support outputting JBIG2 for PDF's.
Ah. It says the encoding is ccitt, which I hope is indeed the same as Group4.
How is the lossless JBIG2 in terms of reading speed? I've seen some very well-compressed PDF files around that unfortunately load so slow that they are almost unreadable on mobile; I think this was JBIG2. In that case, I'm wondering if this can be avoided by proper use or is a necessary downside of the encoding.
As fast as anything else for all practical purposes.
I too have encountered molasses-slow PDF's, and I can't even begin to guess what causes that. Book PDF's from OpenLibrary are often like that for me. Like it genuinely makes me wonder if it's producing each page's image with embedded JavaScript writing to a canvas or something... except that might actually still be faster.
I really appreciate having grayscale or color scans of books rather than bilevel black and white. It's often much easier to read, and often illustrations come out mangled into illegibility by thresholding. Occasionally even text does.
I do too, but I find that they're just too big in file size.
Bilevel at around 300 DPI means scanned books that run 2-5 MB. Grayscale/color tends to mean 10-50 MB.
For me it's less about the storage and more about performance -- for everything from downloading to copying to e-mailing to previews to autosaving while highlighting, apps and cloud services seem to cope well and quickly with 3 MB PDF's, but just seem to slow down dramatically with 30 MB ones.
Nice write up, thank you! I've tried to do some bookscanning a couple of years ago (well, it was mostly getting better and cleaner PDFs than scanning), and the best guide back then was the one by Nate Craun (former maintainer of ScanTailor).
Would you mind writing this guide in a less forum-y site? Or do you know any place to look for a good tutorial and best practices for this hobby?
IMHO a process which is lossy should never be described as deduplication.
What would work out fairly well for this use case is to group files by similarity, and compress them with an algorithm which can look at all 'editions' of a text.
This should mean that storing a PDF with a (perhaps badly, perhaps brilliantly) type-edited version next to it would 'weigh' about as much as the original PDF plus a patch.
> IMHO a process which is lossy should never be described as deduplication.
Depends. There are going to be some cases where files aren't literally duplicates, but the duplicates don't add any value -- for example, MOBI conversions of EPUB files, or multiple versions of an EPUB with different publisher-inserted content (like adding a preview of a sequel, or updating an author's bibliography).
Splitting those into two cases: I think getting rid of format conversions (which can, after all, be performed again) is worthwhile, but isn't deduplication, that's more like pruning.
Multiple versions of an EPUB with slightly different content is exactly the case where a compression algorithm with an attention span, and some metadata to work with can, get the multiple copies down enough in size that there's no point in disposing of the unique parts.
Plus there are a lot of books, where one version is a high quality scan, but no OCR, and the other is OCRed scan (with a bunch of errors, but searching works 80% of the time) and horrible scan quality.
Also, some books included appendices, that are scanned in some versions but not in others, plus large posters, that are shrunk to a4 size in one version, split onto multiple a4 pages in another, and one huge page in a third version.
Then there are zips of books, containing 1 pdf + eg. example code, libraries, etc (eg progrmaming books).
Have to be careful there. A jihad against duplication means that poor-quality scans will drive out good ones, or prevent them from ever being created. Especially if you're misguided enough to optimize for minimum file size.
I agree with samatman's position below: as long as the format is the slightest bit lossy -- and it always will be -- aggressive deduplication has more downsides than upsides.
Deduplication doesn't have to mean removal. It might be just tagging. It would be very nice to be able to fetch the "best filesize" version of the entire collection, then pull down the "best quality" editions of only a few things I'm particularly interested in.
One of my favourite places on the internet too. The thing is, you just search for what you want and spend 10 seconds finding the right book and link. While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, so building my own small archive of annotated PDFs (philosophy is my drug of choice) is better than having the whole. I think it's actually remarkably free of bloat and cruft considering, but maybe I'm not trawling the same corners as you are. Do kind of wish they'd clear out the mobi and djvu versions and make it unified however.
> While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, [...]
I'd love to agree but as a matter of fact LibGen and Sci-Hub are (forced to be) "pirates" and they are more vulnerable to takedowns than other websites. So while I feel no need to maintain a local copy of Wikipedia, since I'm relatively certain that it'll be alive in the next decade, I cannot say the same about those two with the same certainty (not that I think there are any imminent threats to either, just reasoning a priori).
First, SciHub != LibGen. Allied projects that clearly share a support base but not identical.
Second, please provide a citation for the assertion that sharing copies of printed fiction erodes sales volume. At this point, one may assume that anything that helps to sell computer games and offline swag is cash-in-bank for content producers. Whether original authors get the same royalties is an interesting question.
Third, the former Soviet milieu probably isn't currently in the mood to cooperate with western law enforcement.
Even what you call LibGen isn't LG. These are LG forks, actually running against LG and pretending to be LG. LG was set up to create other libraries on its basis. Each of the forks aggressively fights for own dominance in all ways, and they resist the development of other forks by naming themselves LG and sucking in all the funds to personal possession without public reporting. Being forks themselves, they have closed the open project for own ambitions and for money grab.
Their values are incompatible with LG, and all what's left similar is the external part of letting download books, without which there would be nothing useful to look at.
Yeah, and the herculean work is actually done outside such aggregators by myriads of smaller collections, digitizing, binding, processing, collecting, and channeling millions of handmade books into rivers of literature, for free and ready to grab. The growth is global and isn't relevant to what the forks do.
This would be disastrous for preservation. Often the djvu versions have no digital version, the books not in print and the publisher isn't around. The djvu archives are often specifically because some old book, really has and had value to people.
Yeah, I always convert DVJU to PDF (pretty easy) but it never compresses quite as nicely.
DJVU is pretty clever in how it uses a mask layer for more efficient compression, and as far as I know, converting to PDF is always done "dumb" -- flattening the DJVU file into a single image and then encoding that image traditionally in PDF.
I wonder if it's possible to create a "lossless" DJVU to PDF converter, or something close to it, if the PDF primitives allow it? I'm not sure if they do, if the "mask layer" can be efficiently reproduced in PDF.
It can be done with relative ease. There is a commercial tool somewhere that does it, because I've run across many PDF files that use a DjVu like structure for scanned books.
It won't be perfectly lossless, because the IW44 compression of the color layer will need to be recompressed as JPEG or JPEG2000. The JB2 mask layer can be losslessly recompressed as JBIG2 or CCITT G4 Fax.
I'm not for clearing out djvu, but it sure is frustrating when a PDF isn't available.
It's not just about laziness preventing one from installing the more obscure ebook readers which support djvu. It's about security: I only trust PDFs when I create them myself with TeX or similar, otherwise I need to use the Chromium PDF reader to be (relatively) safe. I don't trust the readers that support Djvu to be robust enough against maliciously malformed djvu files, as I'm guessing the readers are implemented in some ancient dialect of C or C++ and I doubt they're getting much if any scrutiny in the way of security.
It's super easy to convert a DJVU file to PDF though. There's an increase in filesize but it's not the end of the world.
And since you're creating the PDF yourself seems like you can trust it? Since nothing malicious could survive the DJVU to PDF conversion since it's just "dumb" bitmap-based.
If your DjVu file contains an exploit for your DjVu decoder, even if you run it in a bombproof container, it could still conceivably inject malicious code into the resulting PDF file. That sounds far-fetched because the exploit payload would need to recognize that a PDF conversion was going on and respond by generating the PDF, but I remember when people thought exploiting buffer overflows was implausible, and this is not the same level of rocket science.
djvu is really quite a marvellous format, but I'm only able to read them on Evince (the default pdf reader that comes with Debian, Fedora, and probably a bunch of other distros). For my macbook I need to download a Djvu reader, and for my ipad, I didn't even bother trying because the experience would likely be much worse than Preview / Ibooks.
DJVU is supported by numerous book-reading applications, including (in my experience) FB Reader (FS/OSS), Pocketbook, and Onyx's Neoreader.
As a format for preserving full native scan views (large, but often strongly preferable for visually-significant works or preserving original typesetting / typography), DJVU is highly useful.
I do wish that it were more widely supported by both toolchains and readers. That will come in time, I suspect.
Calibre supports djvu on any platform. Deleting djvu books just because Microsoft and Apple don't see fit to support it by default would be a travesty.
Is there a torrent available that would allow straightforward setup of locally storable and accessible Libgen library? For the storage rich but internet connection reliability poor, something like this would be a godsend.
My comment about djvu was mostly just about my own laziness, because (kill me if you need to) I like using Preview on the Mac for reading and annotating, and it doesn't read them, and once they have to live in a djvu viewer, I tend not to read them or mark them up. Same goes for Adobe Acrobat Reader when I'm on Windows on my University's networked PCs.
I wish they'd clear out the PDF versions and replace them with DjVu versions. DjView is better than any PDF reader I've used, and DjVu files are smaller than scanned PDFs.
That’s not going to work, because outside of e-book enthusiasts, few know what DJVU is and even fewer have the technological skills and will to figure out how to open it. A large part of LibGen’s demographic are university students downloading exorbitantly priced textbooks, and given that, having both a pdf and a djvu available would be ideal.
GNOME-based Linux distributions ship with DjVu support by default, and so do MATE and KDE and most document viewers for Android. But even if you're not using Linux, if you're going to spend 50 hours studying a textbook and you're part of a learning community like a university class, with dozens of people facing the same problem, one of you can spend 0.5 hours figuring out how to install DjView so you can read the textbook. That's a much easier problem to solve than finding out about Library Genesis in the first place, not to mention fixing your legal system so it's legal.
That's funny, I did the same analysis with sci-hub. Back when there was an organized drive to back it up.
I downloaded parts of it and wanted to figure out why it was so heavy, seeing as you'd expect articles to be mostly text and very light.
There was a similar distribution of file sizes. My immediate instinct was also to cut off the tail-end, but looking at the larger files I realized it was a whole range of good articles that included high quality graphics that were crucial to the research being presented, not poor compression or useless bloat.
I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its canonical form (straight from the publisher) so neither duplication nor low-quality is the case.
It does depend on when the work was published. Pre-digital works scanned in without OCR can be larger in size. That's typically works from the 1980s and before.
Given the explosion of scientific publishing, that's likely a small fraction of the archive by work though it may be significant in terms of storage.
It can be illuminating to look at the size of ePub documents. This is in general an HTML container (and compressed), such that file sizes tend to be quite small. A book-length text (~250 pp or more) might be from 0.3 -- 5 MB, and often at the lower end of the scale.
Books with a large number of images or graphics, however, can still bloat to 40-50 MB or even more.
Otherwise, generally, text-based PDFs (as opposed to scans) are often in the 2--5 MB range, whilst scans can run 40--400 MB. The largest I'm aware of in my own collection is a copy of Lyell's Geography, sourced from Archive.org. It is of course scans of the original 19th century typography. Beautiful to read, but a bit on the weighty side.
I don't think OP takes into account that there seem to be multiple editions of the same book which are often required by people to refer to. Not everyone wants the latest edition when the class you're in is using some old edition.
In practice, it's more often the same file with minor edits such as a PDF table of contents added or page numbers corrected. Say, how many distinct editions of this standard text on elementary algebraic geometry are in the following list?
I like to think that LibGen also serves as a historical database wherein there is a record that a book of a specific edition had its errors corrected. (Although it would be better if errata could be appended to the same file if possible)
Yes, for very minor edits, those files should obviously not exist, but for that there would need to be someone who verifies this, which is such an enormous task that likely no one would take up.
If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.
I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.
How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.
My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.
I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).
As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.
I think one of the problems is the lack of a good open source PDF compressor. We have good open source OCR software like ocrmypdf which I've seen used before, but some of the best compressed books I've seen on libgen used some commercial compressor while the open source ones I've used were generally quite lackluster. This applies double so when people are ripping images from another source, combining them into a PDF then uploading as a high resolution PDF which inevitably ends up being between 70-370 MB.
How to deal with duplication is also a very difficult problem because there's loads of reasons why things could be duplicated. Take a textbook, I've seen duplicates which contain either one or several of the following: different editions, different printings (of any particular edition), added bookmarks/table of contents for the file, removed blank white pages, removed front/end cover pages, removed introduction/index/copyright/book information pages, LaTeX'd copies of pre-TeX textbooks, OCR'd, different resolution, other kinds of optimization by software that reduces to wildly different file sizes, different file types (eg .chm, PDFs that are straight conversions from epub/mobi), etc. Some of this can be detected by machines, eg usage of OCR but some of the other things aren't easy at all to detect.
What commercial compressor/performance are you talking about?
AFAIK the best compression you see is monochrome pages encoded in Group4, which for example ImageMagick will do which is open source, and ocrmypdf happily works on top of.
Otherwise it's just your choice of using underlying JPG, PNG, or JPEG 2000, and up to you to set your desired lossy compression ratio.
While it’s a bit of an extreme case, the file for a single 15-page article on Monte Carlo noise in rendering[1] is over 50M (as noise should specifically not be compressed out of the pictures).
I was just checking my PDFs over 30M because of this post and was surprised to see the DALL-E 2 paper is 41.9M for 27 pages. Lots of images, of course, it was just surprising to see it clock in around a group of full textbooks.
If I remember correctly images in PDFs can be stored full res but are then rendered to final size, which more often than not in double column research papers end up being tiny.
That graph of file size vs. number of files would be much easier to read if it were logarithmic. I guess OP is using matplotlib. In this case, use plt.loglog instead of plt.plot. Also, consider plt.savefig("chart.svg") instead of png.
There are classes of books that are significantly larger than the rest, like medical / biology books. I don't know if they embed vector based images of the whole body or maybe hundreds of images but it's surprising big they are.
Who's in to make some large data gathering about unoptimized books and potentially redudant ones ? or maybe trim pdfs (qpdf can optimize a structure to an extent)
I've experienced scanning personal books and also try to reduce them since I'm also concerned with bloat on my (older) mobile reading devices. Unfortunately, there are reasons I cannot upload those, but the procedures might still be helpful for existing scans.
Use ScanTailor to clean them up. If there is no need for color/grayscale, have the output strictly black and white.
OCR them with Adobe Acrobat ClearScan (or something else, that is what I have).
Convert to black and white DJVU (Djvu-Spec).
Dealing with color is another thing, and takes some time. I find that using G'MIC's anisotropic smoothing can help with the ink-jet/half-tone patterns. But it's too time consuming to be used for books.
I like ScanTailor! I've used ocrmypdf for the OCR and compression steps. It uses lossless JBIG2 by default, at 2 or 3k per page; I'm curious how that compares to DJVU. (And my mistake, pdf and DJVU are competing container formats.)
If the PDF is from a scanned source, converting it to DJVU with equivalent DPI typically results to about half the file size (figures can vary depending on the specifics of the PDF source).
First of all, bloat has nothing to do with file size -- EPUB's are often around 2 MB, typeset PDF's are often 2-10 MB (depending on quantity of illustrations), and scanned PDF's are anywhere from 10 MB (if reduced to black and white) to 100 MB (for colors scans, like where necessary for full-color illustrations).
The idea of a 30 MB cutoff does nothing to reduce bloat, it just removes many of the most essential textbooks. :( Also it's very rare to see duplicates of 100 MB PDF's.
Second, file duplication is there, but it's not really an unwieldy problem right now. Probably the majority of titles have only a single file, many have 2-5 versions, and a tiny minority have 10+. But they're often useful variants -- different editions (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs PDF scan. These are all genuinely useful and need to be kept.
Most of the unhelpful duplication I see tends to fall into three categories:
1) There are often 2-3 versions of the identical typeset PDF except with a different resolution for the cover page image. That one baffles me -- zero idea who uploads the extras or why. My best guess is a bot that re-uploads lower-res cover page versions? But it's usually like original 2.7 MB becoming 2.3 MB, not a big difference. Feels very unnecessary to me.
2) People (or a bot?) who seem to take EPUB's and produce PDF versions. I can understand how that could be done in a helpful spirit, but honestly the resulting PDF's are so abysmally ugly that I really think people are better off producing their own PDF's using e.g. Calibre, with their own desired paper size, font, etc. Unless there's no original EPUB/MOBI on the site, PDF conversions of them should be discouraged IMHO
3) A very small number of titles do genuinely have like 5+ seemingly identical EPUB versions. These are usually very popular bestselling books. I'm totally baffled here as to why this happens.
It does seem like it would be a nice feature to be able to leave some kind of crowdsourced comments/flags/annotations to help future downloaders figure out which version is best for them (e.g. is this PDF an original typeset, a scan, or a conversion? -- metadata from the uploader is often missing or inaccurate here). But for a site that operates on anoynmity, it seems like this would be too open to abuse/spamming. Being able to delete duplicates opens the door to accidental or malicious deleting of anything. I'd rather live with the "bloat", it's really not an impediment to anything at the moment.
Has anyone ever stumbled across an executable on LibGen? The article mentioned finding them but I've never seen one.
I agree with the other comments that LibGen shouldn't purge the larger books. But, in terms of mirrors, it would be nice to have a slimmed down archive I could torrent. 19 TB would be manageable. And would be nice to have a local copy of most of the books.
Thanks for the lists. I was genuinely curious about the exes. Nice to know where they originate. Interesting that over half of them have titles in Cyrillic. I guess not so many English language textbooks (with included CDs) have been uploaded with the data portion intact.
Curation is hard, particularly for a "community" project.
Every file is there for a reason, and much of the time, even if it is a stupid reason, removing it means there is one more person opposed to the concept of "curation".
Um, if the goal is to fit what you can onto a 20TB hard drive at home, then nobody is stopping you from choosing your own subset, as opposed to deleting stuff out of the main archive based on ham-handed criteria...
Z-Library has been innovating a great deal in that regard. Sadly they are not as open/sharing as LibGen mirrors in giving back to the community (in terms of database dumps, torrents, and source code).
In that case your expectations are understandable. People in your generation are accustomed to finding anything in mere seconds. Not very long ago, if it took you a few minutes to find a book in the catalogue you would count yourself lucky. And if your local library didn't have the book you're looking for, you could spend weeks waiting for the book to arrive from another library in the system.
Libgen's search certainly isn't as good as it could be, but it's more than good enough. If you can't bear spending a few minutes searching for a book, can you even claim to want that book in the first place? It's hard for me to even imagine being in such a rush that a few minutes searching in a library is too much to tolerate. But then again, I wasn't raised with the expectations of your generation.
While it's nice to see people reading, learning, and loving libraries, keep in mind the Library Genesis remnants you are typically using are money hogs covering their profiteering under the original altruistic LG disguise. They don't produce forks and link up everyone to work for their own growth. That's not what LG used to be.
Storage space is not a problem, especially not on the order of terabytes. If you want to download all of libgen on a cheap drive, perhaps limit yourself to epub files only. No one needs all of libgen anyway except archivists and data hoarders.
Yes, that makes you a data hoarder. Normal people would just use one of the many other methods of getting free books, like legal libraries, googling it on Yandex, torrents, asking a friend, etc. Or just actually pay for a book.
My target audience is not normal people though, and I don't mean this in the "edgy" sense. The fact that we are having this discussion is very abnormal to begin with, and I think it's great that there are some deviants from the norm who care about the longevity of such projects.
I can imagine many students and researchers hosting a mirror of LibGen for their fellows for example.
I'd love to see that distribution at the end with a log-axis for the file size! Or maybe even log-log, depending. Gives a much better sense of "shape" when working with these sorts of exponential distributions
I've been dreaming of a book decompiler that would some newfangled AI/ML to produce a perfectly typeset copy of an older book; in the same font or similar, recognizing multiple languages and scripts within the work.
In the same vein, I would like an e-reader that has TeX or InDesign quality typesetting. I'd settle for Knuth-Plass line breaking with decent justification (and hyphenation).
At the very least, make it so that headings do not appear at the bottom of a page. Who thought that was OK?
In an ideal world, every book could be given an "importance" score, for some arbitrary value of importance. For example, how often it is cited. This could be customised on a per-user basis, depending on which subjects and time periods you're interested in.
Then you can specify your disk size, and solve the knapsack problem to figure the optimal subset of files that you should store.
Edit: Curious to see this being downvoted. Is it really that bad of an idea? Or just off-topic?
I'm not suggesting reducing the size of the LibGen collection, I'm thinking along the lines of "I have 2TB of disk space spare, and I want to fill it with as much culturally-relevant information as possible".
If the entire collection were availble as a torrent (maybe it already is?), I could select which files I wish to download, and then seed.
Those who have 52TB to spare would of course aim to store everything, but most people don't.
Just as the proposal in the OP would result in the remaining 32.59 TB of data being less well replicated, my approach has the problem that less "popular" files would be poorly replicated, but you could solve that by also selecting some files at random. (e.g. 1.5TB chosen algorithmically, 0.5TB chosen at random).
I don't think you've deserved the downvotes, and I don't think it's a bad idea either; indeed some coordination as to how to seed the collection is really needed.
For instance phillm.net maintains a dynamically updated list of LibGen and Sci-Hub torrents with less than 3 seeders so that people can pick some at random and start seeding: https://phillm.net/libgen-seeds-needed.php
Seems like a perfectly good idea to me! Basically how proposing that we decide caching by some score, and the details of the score function be tweaked to handle the different aspects we care for.
I wonder whether this idea is already used for locating data in distributed systems — from clusters all the way to something like IPFS.
Maybe if the objective is preservation, instead of each person saving an entire copy of libgen locally, people [in a country where this is legal] should save N-of-M shares of it. 51.50 TB in a 5-of-M shares setup would be under 11 TB per share; if M were 16-32 or so, and the community remained sufficiently active to replace the shares held by lapsed participants, it would have a good chance of surviving the next big historical period of book-burnings.
A 2TB disk apparently costs about US$40 right now (https://www.mercadolibre.com.ar/disco-duro-interno-western-d... for example) so this is a contribution of about US$200 per participant. Plus the risk of being arrested in the future for possessing forbidden information, of course, but maybe the fact that you can't decrypt it without four other participants would reduce that risk.
Of course it's also worthwhile to keep plaintext copies of books you actually read, or might want to read, or want to pretend you actually read. My copy of Kenneth Snelson's Art and Ideas is 41 MB and 174 pages (US$0.0008, 240 kB/page); my copy of Boole's Treatise on the Calculus of Finite Differences is 19 MB and 356 pages (US$0.0004, 53 kB/page); my copy of Kevin Carson's Homebrew Industrial Revolution is 3.8 MB and 399 pages (US$0.00008, 9.5 kB/page). If you were to devote a single terabyte to books for yourself at 10 megabytes per book, you'd still have room for 100'000 books, quite a nice library by any historical standard, even if it's small compared to all of libgen. Perhaps a small group of people [again, in a country where this is legal] participating in a distributed storage system could preserve significant fractions of libgen in such a way even in the face of disaster.
I think it's reasonable to weight such preservation efforts toward lighter-weight books, but I also think it's easy to screw that up, for example by keeping only books under 30MB, which throws away all the decent scans of many books, leaving only worthless epub versions.
At this point, though, it might be more important to create durable physical artifacts encoding this incomparable treasure; many historical events have obliterated communities of learning, leaving only artifacts. The Cambodian Killing Fields are one recent example, but we can also point to the Spanish Catholic zealots burning the khipu and the Maya codices; the Boxers burning most of the last copy of the Yongle Encyclopedia; Qin Shi Huang's Burning of the Books and Burying of the Scholars; the Christians' prohibition on the Egyptian religion, which brought to a close the millennia-long knowledge of hieroglyphs; and Sulla's conquest of Syracuse, in which Archimedes was killed and his knowledge of the integral calculus was lost until Newton.
> by filtering any "books" (rather, files) that are larger than 30 MiB we can reduce the total size of the collection from 51.50 TB to 18.91 TB, shaving a whopping 32.59 TB
Books greater than 30 MiB are all the textbooks.
You are killing the knowledge.
Also killing a lot of rare things.
If you want to do something amazing and small, OCR them.
As an example of greater than 30 meg, I grabbed a short story by Greg Bear the other day not available digitally, it was in a 90 meg copy of a 1983 Analog Science Fiction and Fact
Side note de-duping is an incredibly hard project, how will you diff a mobi and a epub and then make a decision? Or a decision between a mobi and a mobi?
Books also change with time. Even in the 90's kids books from the 60's had been 'edited' These can be hidden gems to collectors. Cover art also.
I do agree that there may be some large files on there, however I dont agree with removing them. I spent some hours to put this book on there so others who need it can access it within seconds. Removing it because it is too large would void all this effort and require future users to go through a similar process than i did just to browse through the book.
Also any book published today is most likely available in some ebook format, which is much smaller in size, so I dont think that the size of libgen will continue to grow at the same pace as it is doing now.