LibGen's Bloat Problem

c-fe · on Aug 21, 2022

This is a bit anecdotal, but I did upload a book to libgen. I am am avid user of the site, and during my thesis research I was looking for a specific book and could not find it on there. I did however find it on archive.org. I spent the better half of one afternoon extracting the book from archive.org with some Adobe software, since I had to circumvent some DRM and other things, and all of this was also novel to me. In the end I got a scanned PDF, which had several hundred MB. I managed to reduce it to 47 MB, however further reduction was not easily possible at least not with the means I knew or had at my disposal. I uploaded this version to libgen.

I do agree that there may be some large files on there, however I dont agree with removing them. I spent some hours to put this book on there so others who need it can access it within seconds. Removing it because it is too large would void all this effort and require future users to go through a similar process than i did just to browse through the book.

Also any book published today is most likely available in some ebook format, which is much smaller in size, so I dont think that the size of libgen will continue to grow at the same pace as it is doing now.

liberalgeneral · on Aug 21, 2022

Thank you for your efforts!

To be clear, I am not advocating for the removal of any files larger than 30 MiB (or any other arbitrary hard limits). It'd be great of course to flag large files for further review, but the current software doesn't do a great job at crowdsourcing these kinds of tasks (another one being deduplication) sadly.

Given the very little amount of volunteer-power, I'm suggesting that a "lean edition" of LibGen can still be immensely useful to many people.

ssivark · on Aug 21, 2022

Files are a very bad unit to elevate in importance, and number of files or file size are really bad proxy metrics, especially without considering the statistical distribution of downloads (leave alone the question of what is more "important"!). Eg: Junk that’s less than the size limit is implicitly being valued over good content that happens to be larger in size. Textbooks & reference books will likewise get filtered out with higher likelihood — and that would screw students in countries where they cannot afford them (which might arguable be a more important audience to some, compared to those downloading comics). Etc.

After all this, the most likely human response from people who really depend on this platform would be to slice a big file into volumes under the size limit. Seems to be a horrible UX downgrade in the medium to long term for no other reason than satisfying some arbitrary metric of legibility[1].

Here's a different idea -- might it be worthwhile to convert the larger files to better compressed versions eg. PDF -> DJVU? This would lead to a duplication in the medium term, but if one sees a convincing pattern that users switch to the compressed versions without needing to come back to the larger versions, that would imply that the compressed version works and the larger version could eventually be garbage collected.

Thinking in an even more open-ended manner, if this corpus is not growing at a substantial rate, can we just wait out a decade or so of storage improvements before this becomes a non-issue? How long might it take for storage to become 3x, 10x, 30x cheaper?

[1]: https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-call...

didgetmaster · on Aug 21, 2022

> can we just wait out a decade or so of storage improvements before this becomes a non-issue?

I'm not sure that there is anything on the horizon which would make duplicate data a 'non-issue'. Capacities are certainly growing, so within a decade we might see 100TB HDDs available and affordable 20TB SSDs. But that does not solve the bandwidth issues. It still takes a long, long time to transfer all the data.

The fastest HDD is still under 300MB/s which means it takes a minimum of 20 hours to read all the data off a 20TB HDD. That is if you could somehow get it to read the whole thing at the maximum sustained read speed.

SSDs are much faster, but it will always be easier to double the capacity than it is to double the speed.

fragmede · on Aug 21, 2022

The problem isn't the technology, it's the cost. Given a far larger budget, you wouldn't run the hard drives at anywhere near capacity, in order to gain a read speed advantage by running a ton in parallel. That'll let you read 20 TB in a hour if you can afford it. Put it this way; Netflix is able to do 4k video and that's far more intensive.

titoCA321 · on Aug 22, 2022

There's people that contribute to the LibGen ecosystem but unfortunately it in areas that don't really benefit the community. Users don't need another CLI tool for LibGen, nor does the community need another Bot. Unfortunately that's what folks do, make extensions, CLI tools and bots that benefit next to no one and release all over silly willy with no support.

account42 · on Aug 22, 2022

So what have you done that benefits the community and why do you think you get to make that decision for others?

titoCA321 · on Aug 22, 2022

I'm not the OP that's calling for reducing Bloat. I don't use LibGen enough to give a care one way or the other.

culi · on Aug 21, 2022

I've always wanted to contribute to LibGen. Got me through college and has powered my Wikipedia editing hobby

Are there any good guides out there for best practices for minimizing files, scanning books, etc?

generationP · on Aug 21, 2022

There's a bunch. Here's what I do (for black-and-white text; I'm not sure how to deal with more complex scenarios):

Scan with 600dpi resolution. Nevermind that this gives huge output files; you'll compress them to something much smaller at the end, and the better your resolution, the stronger compression you can use without losing readability.

While scanning, periodically clean the camera or the scanner screen, to avoid speckles of dirt on the scan.

The ideal output formats are TIF and PNG; use them if your scanner allows. PDF is also fine (you'll then have to extract the pages into TIF using pdfimages or using ScanKromsator). Use JPG only as a last resort, if nothing else works.

Once you have TIF, PNG or JPG files, put them into a folder. Make sure that the files are sorted correctly: IIRC, the numbers in their names should match their order (i.e., blob030 must be an earlier page than blah045; it doesn't matter whether the numbers are contiguous or what the non-numerical characters are). (I use the shell command mmv for convenient renaming.)

Import this folder into ScanTailor ( https://github.com/4lex4/scantailor-advanced/releases ), save the project, and run it through all 6 stages.

Stage 1 (Fix Orientation): Use the arrow buttons to make sure all text is upright. Use Q and W to move between pages.

Stage 2 (Split Pages): You can auto-run this using the |> button, but you should check that the result is correct. It doesn't always detect the page borders correctly. (Again, use Q and W to move between pages.)

Stage 3 (Deskew): Auto-run using |>. This is supposed to ensure that all text is correctly rotated. If some text is still skew, you can detect and fix this later.

Stage 4 (Select Content): This is about cutting out the margins. This is the most grueling and boring stage of the process. You can auto-run it using |>, but it will often cut off too much and you'll have to painstakingly fix it by hand. Alternatively (and much more quickly), set "Content Box" to "Disable" and manually cut off the most obvious parts without trying to save every single pixel. Don't worry: White space will not inflate the size of the ultimate file; it compresses well. The important thing is to cut off the black/grey parts beyond the pages. In this process, you'll often discover problems with your scan or with previous stages. You can always go back to previous stages to fix them.

Stage 5 (Margins): I auto-run this.

Stage 6 (Output): This is important to get right. The despeckling algorithm often breaks formulas (e.g., "..."s get misinterpreted as speckles and removed), so I typically uncheck "Despeckle" when scanning anything technical (it's probably fine for fiction). I also tend to uncheck "Savitzki-Golay smoothing" and "Morphological smoothing" for some reason; don't remember why (probably they broke something for me in some case). The "threshold" slider is important: Experiment with it! (Check which value makes a typical page of your book look crisp. Be mindful of pages that are paler or fatter than others. You can set it for each page separately, but most of the time it suffices to find one value for the whole book, except perhaps the cover.) Note the "Apply To..." buttons; they allow you to promote a setting from a single page to the whole book. (Keep in mind that there are two -- the second one is for the despeckling setting.)

Now look at the tab on the right of the page. You should see "Output" as the active one, but you can switch to "Fill Zones". This lets you white-out (or black-out) certain regions of the page. This is very useful if you see some speckles (or stupid write-ins, or other imperfections) that need removal. I try not to be perfectionistic: The best way to avoid large speckles is by keeping the scanner clean at the scanning stage; small ones aren't too big a deal; I often avoid this stage unless I know I got something dirty. Some kinds of speckles (particularly those that look like mathematical symbols) can be confusing in a scan.

There is also a "Picture Zones" rider for graphics and color; that's beyond my paygrade.

Auto-run stage 6 again at the end (even if you think you've done everything -- it needs to recompile the output TIFFs).

Now, go to the folder where you have saved your project, and more precisely to its "out/" subfolder. You should see a bunch of .tif files, each one corresponding to a page. Your goal is to collect them into one PDF. I usually do this as follows:

  tiffcp *.tif ../combined.tif
  tiff2pdf -o ../combined.pdf ../combined.tif
  rm -v ../combined.tif

Thus you end up with a PDF in the folder in which your project is.

Optional: add OCR to it; add bookmarks for chapters and sections; add metadata; correct the page numbering (so that page 1 is actual page 1). I use PDF-XChangeLite for this all; but use whatever tool you know best.

At that point, your PDF isn't super-compressed (don't know how to get those), but it's reasonable (about 10MB per 200 pages), and usually the quality is almost professional.

Uploading to LibGen... well, I think they've made the UI pretty intuitive these days :)

PS. If some of this is out of date or unnecessarily complicated, I'd love to hear!

crazygringo · on Aug 21, 2022

> At that point, your PDF isn't super-compressed (don't know how to get those)

As far as I know, it's making sure your text-only pages are monochrome (not grayscale) and to use Group4 compression for them, which is actually what fax machines use (!) and is optimized specifically for monochrome text. Both TIFF and PDF's support Group4 -- I use ImageMagick to take a scanned input page and run grayscale, contrast, Group4 monochrome encoding, and PDF conversion in one fell swoop which generates one PDF per page, and then "pdfunite" to join the pages. Works like a charm.

I'm not aware of anything superior to Group4 for regular black and white text pages, but would love to know if there is.

generationP · on Aug 21, 2022

Oh, I should have said that I scan in grayscale, but ScanTailor (at stage 6) makes the output monochrome; that's what the slider is about (it determines the boundary between what will become black and what will become white). So this isn't what I'm missing.

I am not sure if the result is G4-compressed, though. Is there a quick way to tell?

crazygringo · on Aug 21, 2022

On my system I can run 'pdfimages -list' on a PDF it gives me all the images in a PDF with their encoding format. The utility comes with 'poppler-utils' I believe.

And I'm just now discovering by checking on my own PDF's, that 'ocrmypdf' will automatically convert Group4 to lossless JBIG2 (if optimizations are enabled) which is supposedly even more efficient for monochrome -- but encoders aren't always available [1].

I don't think ImageMagick has been updated yet to support outputting JBIG2 for PDF's.

[1] https://ocrmypdf.readthedocs.io/en/latest/jbig2.html

generationP · on Aug 22, 2022

Ah. It says the encoding is ccitt, which I hope is indeed the same as Group4.

How is the lossless JBIG2 in terms of reading speed? I've seen some very well-compressed PDF files around that unfortunately load so slow that they are almost unreadable on mobile; I think this was JBIG2. In that case, I'm wondering if this can be avoided by proper use or is a necessary downside of the encoding.

crazygringo · on Aug 23, 2022

As fast as anything else for all practical purposes.

I too have encountered molasses-slow PDF's, and I can't even begin to guess what causes that. Book PDF's from OpenLibrary are often like that for me. Like it genuinely makes me wonder if it's producing each page's image with embedded JavaScript writing to a canvas or something... except that might actually still be faster.

homarp · on Aug 22, 2022

beware with JBIG: "Undetectable Data Corruption in JB2/JBIG2" https://news.ycombinator.com/item?id=32537073

generationP · on Aug 22, 2022

I suspect this is referring to the lossy version of JBIG2 (which is essentially the same as DjVU encoding).

kragen · on Aug 22, 2022

I really appreciate having grayscale or color scans of books rather than bilevel black and white. It's often much easier to read, and often illustrations come out mangled into illegibility by thresholding. Occasionally even text does.

crazygringo · on Aug 23, 2022

I do too, but I find that they're just too big in file size.

Bilevel at around 300 DPI means scanned books that run 2-5 MB. Grayscale/color tends to mean 10-50 MB.

For me it's less about the storage and more about performance -- for everything from downloading to copying to e-mailing to previews to autosaving while highlighting, apps and cloud services seem to cope well and quickly with 3 MB PDF's, but just seem to slow down dramatically with 30 MB ones.

ajot · on Aug 22, 2022

Nice write up, thank you! I've tried to do some bookscanning a couple of years ago (well, it was mostly getting better and cleaner PDFs than scanning), and the best guide back then was the one by Nate Craun (former maintainer of ScanTailor).

Would you mind writing this guide in a less forum-y site? Or do you know any place to look for a good tutorial and best practices for this hobby?

generationP · on Aug 22, 2022

Thank you for the kind words. I'm not sure I know a less forum-y site for these things. The standard places for knowledge like this are forums: e.g., https://diybookscanner.org/forum/viewtopic.php?f=21&t=3455&s... or https://forum.mhut.org/ or https://dxdy.ru/sozdanie-elektronnyh-knig-f7.html .

c-fe · on Aug 22, 2022

I followed (and learned a lot) from this excellent post:

https://www.reddit.com/r/Piracy/comments/l9exis/how_to_downl...

jtbayly · on Aug 21, 2022

Agreed. Deduplication should be the bigger goal, in my opinion.

DiggyJohnson · on Aug 21, 2022

Even then, I wouldn’t want a file with text + illustrations to be considered a dupe of a text-only copy of the same work.

samatman · on Aug 21, 2022

IMHO a process which is lossy should never be described as deduplication.

What would work out fairly well for this use case is to group files by similarity, and compress them with an algorithm which can look at all 'editions' of a text.

This should mean that storing a PDF with a (perhaps badly, perhaps brilliantly) type-edited version next to it would 'weigh' about as much as the original PDF plus a patch.

duskwuff · on Aug 21, 2022

> IMHO a process which is lossy should never be described as deduplication.

Depends. There are going to be some cases where files aren't literally duplicates, but the duplicates don't add any value -- for example, MOBI conversions of EPUB files, or multiple versions of an EPUB with different publisher-inserted content (like adding a preview of a sequel, or updating an author's bibliography).

samatman · on Aug 21, 2022

Splitting those into two cases: I think getting rid of format conversions (which can, after all, be performed again) is worthwhile, but isn't deduplication, that's more like pruning.

Multiple versions of an EPUB with slightly different content is exactly the case where a compression algorithm with an attention span, and some metadata to work with can, get the multiple copies down enough in size that there's no point in disposing of the unique parts.

ajsnigrutin · on Aug 21, 2022

Plus there are a lot of books, where one version is a high quality scan, but no OCR, and the other is OCRed scan (with a bunch of errors, but searching works 80% of the time) and horrible scan quality.

Also, some books included appendices, that are scanned in some versions but not in others, plus large posters, that are shrunk to a4 size in one version, split onto multiple a4 pages in another, and one huge page in a third version.

Then there are zips of books, containing 1 pdf + eg. example code, libraries, etc (eg progrmaming books).

CamperBob2 · on Aug 21, 2022

Have to be careful there. A jihad against duplication means that poor-quality scans will drive out good ones, or prevent them from ever being created. Especially if you're misguided enough to optimize for minimum file size.

I agree with samatman's position below: as long as the format is the slightest bit lossy -- and it always will be -- aggressive deduplication has more downsides than upsides.

exmadscientist · on Aug 21, 2022

Deduplication doesn't have to mean removal. It might be just tagging. It would be very nice to be able to fetch the "best filesize" version of the entire collection, then pull down the "best quality" editions of only a few things I'm particularly interested in.

willnonya · on Aug 21, 2022

While intended to agree the duplicates need to be easily identifiable and preferably filterable by quality for bulk downloads.

signaru · on Aug 21, 2022

Probably only safe in cases where the files in question are exactly the same binaries (if binary diffing can be automated somehow).

gizajob · on Aug 21, 2022

One of my favourite places on the internet too. The thing is, you just search for what you want and spend 10 seconds finding the right book and link. While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, so building my own small archive of annotated PDFs (philosophy is my drug of choice) is better than having the whole. I think it's actually remarkably free of bloat and cruft considering, but maybe I'm not trawling the same corners as you are. Do kind of wish they'd clear out the mobi and djvu versions and make it unified however.

liberalgeneral · on Aug 21, 2022

> While I'd love to mirror whole archive locally, it would really be superfluous because I can only read a couple of quality books at a time anyway, [...]

I'd love to agree but as a matter of fact LibGen and Sci-Hub are (forced to be) "pirates" and they are more vulnerable to takedowns than other websites. So while I feel no need to maintain a local copy of Wikipedia, since I'm relatively certain that it'll be alive in the next decade, I cannot say the same about those two with the same certainty (not that I think there are any imminent threats to either, just reasoning a priori).

BossingAround · on Aug 21, 2022

Speaking of mirroring, is there a way to download one big "several-hundred-GB" blob with the full content of the sites for archival purposes?

Surely that would act as a failsafe to your problem.

charcircuit · on Aug 21, 2022

I think it's split into a several different torrents since it's so big.

jart · on Aug 21, 2022

Well when a site claims it's for scientific research articles, and you search for "Game Of Thrones" and find this:

https://libgen.is/search.php?req=game+of+thrones&lg_topic=li...

Someone's going to prison eventually, like The Pirate Bay founders. It's only a matter of time.

contingencies · on Aug 21, 2022

First, SciHub != LibGen. Allied projects that clearly share a support base but not identical.

Second, please provide a citation for the assertion that sharing copies of printed fiction erodes sales volume. At this point, one may assume that anything that helps to sell computer games and offline swag is cash-in-bank for content producers. Whether original authors get the same royalties is an interesting question.

Third, the former Soviet milieu probably isn't currently in the mood to cooperate with western law enforcement.

dudehere · on Aug 22, 2022

Even what you call LibGen isn't LG. These are LG forks, actually running against LG and pretending to be LG. LG was set up to create other libraries on its basis. Each of the forks aggressively fights for own dominance in all ways, and they resist the development of other forks by naming themselves LG and sucking in all the funds to personal possession without public reporting. Being forks themselves, they have closed the open project for own ambitions and for money grab.

Their values are incompatible with LG, and all what's left similar is the external part of letting download books, without which there would be nothing useful to look at.

Yeah, and the herculean work is actually done outside such aggregators by myriads of smaller collections, digitizing, binding, processing, collecting, and channeling millions of handmade books into rivers of literature, for free and ready to grab. The growth is global and isn't relevant to what the forks do.

Sorry to tell.

contingencies · on Aug 22, 2022

One might suppose it's not organizationally transparent for good reason.

dudehere · on Aug 22, 2022

It used to be for a good reason, indeed, but not any longer.

sitkack · on Aug 21, 2022

> djvu versions

This would be disastrous for preservation. Often the djvu versions have no digital version, the books not in print and the publisher isn't around. The djvu archives are often specifically because some old book, really has and had value to people.

crazygringo · on Aug 21, 2022

Yeah, I always convert DVJU to PDF (pretty easy) but it never compresses quite as nicely.

DJVU is pretty clever in how it uses a mask layer for more efficient compression, and as far as I know, converting to PDF is always done "dumb" -- flattening the DJVU file into a single image and then encoding that image traditionally in PDF.

I wonder if it's possible to create a "lossless" DJVU to PDF converter, or something close to it, if the PDF primitives allow it? I'm not sure if they do, if the "mask layer" can be efficiently reproduced in PDF.

maskros · on Aug 22, 2022

It can be done with relative ease. There is a commercial tool somewhere that does it, because I've run across many PDF files that use a DjVu like structure for scanned books.

It won't be perfectly lossless, because the IW44 compression of the color layer will need to be recompressed as JPEG or JPEG2000. The JB2 mask layer can be losslessly recompressed as JBIG2 or CCITT G4 Fax.

sitkack · on Aug 21, 2022

If you smoke enough algebra, you could use the DJVU algorithm to implement DJVU in PDF with layers. Or heck you could do it in SVG.

kragen · on Aug 22, 2022

You can't do this in either PDF or SVG, except using JS, which many PDF and SVG viewers don't support.

RicoElectrico · on Aug 21, 2022

I think more often than not djvu blurs out half-tone which can be too aggressive for e.g. newspaper scans and makes a blurry mess.

scott_siskind · on Aug 21, 2022

Why would they clear out djvu? It's one of the best/most efficient storage format for scanned books.

nsajko · on Aug 21, 2022

I'm not for clearing out djvu, but it sure is frustrating when a PDF isn't available.

It's not just about laziness preventing one from installing the more obscure ebook readers which support djvu. It's about security: I only trust PDFs when I create them myself with TeX or similar, otherwise I need to use the Chromium PDF reader to be (relatively) safe. I don't trust the readers that support Djvu to be robust enough against maliciously malformed djvu files, as I'm guessing the readers are implemented in some ancient dialect of C or C++ and I doubt they're getting much if any scrutiny in the way of security.

crazygringo · on Aug 21, 2022

It's super easy to convert a DJVU file to PDF though. There's an increase in filesize but it's not the end of the world.

And since you're creating the PDF yourself seems like you can trust it? Since nothing malicious could survive the DJVU to PDF conversion since it's just "dumb" bitmap-based.

kragen · on Aug 22, 2022

DjVu also contains text.

If your DjVu file contains an exploit for your DjVu decoder, even if you run it in a bombproof container, it could still conceivably inject malicious code into the resulting PDF file. That sounds far-fetched because the exploit payload would need to recognize that a PDF conversion was going on and respond by generating the PDF, but I remember when people thought exploiting buffer overflows was implausible, and this is not the same level of rocket science.

xdavidliu · on Aug 21, 2022

djvu is really quite a marvellous format, but I'm only able to read them on Evince (the default pdf reader that comes with Debian, Fedora, and probably a bunch of other distros). For my macbook I need to download a Djvu reader, and for my ipad, I didn't even bother trying because the experience would likely be much worse than Preview / Ibooks.

dredmorbius · on Aug 21, 2022

DJVU is supported by numerous book-reading applications, including (in my experience) FB Reader (FS/OSS), Pocketbook, and Onyx's Neoreader.

As a format for preserving full native scan views (large, but often strongly preferable for visually-significant works or preserving original typesetting / typography), DJVU is highly useful.

I do wish that it were more widely supported by both toolchains and readers. That will come in time, I suspect.

MichaelCollins · on Aug 21, 2022

Calibre supports djvu on any platform. Deleting djvu books just because Microsoft and Apple don't see fit to support it by default would be a travesty.

eru · on Aug 21, 2022

Apparently you can install Evince on MacOS as well. But I haven't tried it there.

Evince doesn't come by default with Archlinux (my desktop distribution of choice), but I still install it everywhere.

nsajko · on Aug 21, 2022

> Evince doesn't come by default with Archlinux (my desktop distribution of choice)

This doesn't make sense; nothing comes "by default" on Arch, but evince is in the official repos as far as I see.

eru · on Aug 22, 2022

A few things come by default on Arch. See the list at https://archlinux.org/packages/core/any/base/ (many some of these entries like coreutils expand to more packages).

Yes, evince is in the official repos. Just like Chromium and Firefox. Or bash, but not any other shell (as far as I can tell).

napier · on Aug 21, 2022

Is there a torrent available that would allow straightforward setup of locally storable and accessible Libgen library? For the storage rich but internet connection reliability poor, something like this would be a godsend.

mdaniel · on Aug 21, 2022

They have a dedicated page where they offer torrents, so pick one of the currently available hostnames: https://duckduckgo.com/?q=libgen+torrent&ia=web

Obviously, folks can disagree on the "straightforward" part of your comment given the overwhelming number of files we're discussing

gizajob · on Aug 21, 2022

My comment about djvu was mostly just about my own laziness, because (kill me if you need to) I like using Preview on the Mac for reading and annotating, and it doesn't read them, and once they have to live in a djvu viewer, I tend not to read them or mark them up. Same goes for Adobe Acrobat Reader when I'm on Windows on my University's networked PCs.

kragen · on Aug 22, 2022

I wish they'd clear out the PDF versions and replace them with DjVu versions. DjView is better than any PDF reader I've used, and DjVu files are smaller than scanned PDFs.

james-redwood · on Aug 22, 2022

That’s not going to work, because outside of e-book enthusiasts, few know what DJVU is and even fewer have the technological skills and will to figure out how to open it. A large part of LibGen’s demographic are university students downloading exorbitantly priced textbooks, and given that, having both a pdf and a djvu available would be ideal.

kragen · on Aug 22, 2022

GNOME-based Linux distributions ship with DjVu support by default, and so do MATE and KDE and most document viewers for Android. But even if you're not using Linux, if you're going to spend 50 hours studying a textbook and you're part of a learning community like a university class, with dozens of people facing the same problem, one of you can spend 0.5 hours figuring out how to install DjView so you can read the textbook. That's a much easier problem to solve than finding out about Library Genesis in the first place, not to mention fixing your legal system so it's legal.

ad404b8a372f2b9 · on Aug 21, 2022

That's funny, I did the same analysis with sci-hub. Back when there was an organized drive to back it up.

I downloaded parts of it and wanted to figure out why it was so heavy, seeing as you'd expect articles to be mostly text and very light.

There was a similar distribution of file sizes. My immediate instinct was also to cut off the tail-end, but looking at the larger files I realized it was a whole range of good articles that included high quality graphics that were crucial to the research being presented, not poor compression or useless bloat.

liberalgeneral · on Aug 21, 2022

I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its canonical form (straight from the publisher) so neither duplication nor low-quality is the case.

dredmorbius · on Aug 21, 2022

It does depend on when the work was published. Pre-digital works scanned in without OCR can be larger in size. That's typically works from the 1980s and before.

Given the explosion of scientific publishing, that's likely a small fraction of the archive by work though it may be significant in terms of storage.

dredmorbius · on Aug 21, 2022

It can be illuminating to look at the size of ePub documents. This is in general an HTML container (and compressed), such that file sizes tend to be quite small. A book-length text (~250 pp or more) might be from 0.3 -- 5 MB, and often at the lower end of the scale.

Books with a large number of images or graphics, however, can still bloat to 40-50 MB or even more.

Otherwise, generally, text-based PDFs (as opposed to scans) are often in the 2--5 MB range, whilst scans can run 40--400 MB. The largest I'm aware of in my own collection is a copy of Lyell's Geography, sourced from Archive.org. It is of course scans of the original 19th century typography. Beautiful to read, but a bit on the weighty side.

boarush · on Aug 21, 2022

I don't think OP takes into account that there seem to be multiple editions of the same book which are often required by people to refer to. Not everyone wants the latest edition when the class you're in is using some old edition.

generationP · on Aug 21, 2022

In practice, it's more often the same file with minor edits such as a PDF table of contents added or page numbers corrected. Say, how many distinct editions of this standard text on elementary algebraic geometry are in the following list?

http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o...

Fun fact: the newest one (the 2018 corrected version of the 2015 fourth edition) is not among them.

boarush · on Aug 21, 2022

I like to think that LibGen also serves as a historical database wherein there is a record that a book of a specific edition had its errors corrected. (Although it would be better if errata could be appended to the same file if possible)

Yes, for very minor edits, those files should obviously not exist, but for that there would need to be someone who verifies this, which is such an enormous task that likely no one would take up.

ZeroGravitas · on Aug 21, 2022

I notice they have a place to store the OpenLibrary ID, though I've not seen one filled in as yet.

OpenLibrary provides both Work and Edition ids, which helps connect different versions.

Their database is not perfect either, but it might make more sense to keep the bibliographic data seperate from the copyright contents anyway.

https://openlibrary.org/works/OL1849157W/Ideals_varieties_an...

liberalgeneral · on Aug 21, 2022

If you are referring to my duplication comments, sure (but even then I believe there are duplicates of the exact same edition of the same book). Though the filtering by filesize is orthogonal to editions etc. so has nothing to do with that.

xenr1k · on Aug 21, 2022

I agree. There are duplicates. I have seen it.

I have found the same book with multiple sized pdf, with same content. Someone maybe uploaded a poorly scanned pdf when the book was first released but later Someone else uploaded a OCRed version, but the first one just stayed hogging a large amount of storage.

MichaelCollins · on Aug 21, 2022

How do you automate the process of figuring out which version is better? It's not safe to assume the smaller versions are always better, nor the inverse. Particularly for books with images, one version of the book may have passable image quality while the other compressed the images to jpeg mush. And there are considerations that are difficult to judge quantitatively, like the quality of formatting. Even something seemingly simple like testing whether a book's TOC is linked correctly entails a huge rats nest of heuristics and guesswork.

kragen · on Aug 22, 2022

My usual heuristic is to take the version with the largest number of pages, or if there are several with the same number of pages, the one with the largest filesize. Obviously if someone is gaming this it won't work; it's trivial to insert mountains of noise into a PDF.

macintux · on Aug 21, 2022

I don’t think anyone is arguing it can be fully automated, but automating the selection of books to manually review is certainly viable.

kragen · on Aug 22, 2022

I usually prefer the scanned PDF in these cases, because the OCRed version often contains errors, and in cases where the book matters, those errors can be very difficult to detect (incorrect superscripts in equations and things like that). Sometimes it's so poorly scanned that I don't prefer the scan (especially a problem with scans by Google Books).

rinze · on Aug 21, 2022

As the previous reply said, I've also seen duplicates while browsing. Would it be possible to let users flag duplicates somehow? It involves human unreliability, which is like automated unreliability, only different.

mjreacher · on Aug 21, 2022

I think one of the problems is the lack of a good open source PDF compressor. We have good open source OCR software like ocrmypdf which I've seen used before, but some of the best compressed books I've seen on libgen used some commercial compressor while the open source ones I've used were generally quite lackluster. This applies double so when people are ripping images from another source, combining them into a PDF then uploading as a high resolution PDF which inevitably ends up being between 70-370 MB.

How to deal with duplication is also a very difficult problem because there's loads of reasons why things could be duplicated. Take a textbook, I've seen duplicates which contain either one or several of the following: different editions, different printings (of any particular edition), added bookmarks/table of contents for the file, removed blank white pages, removed front/end cover pages, removed introduction/index/copyright/book information pages, LaTeX'd copies of pre-TeX textbooks, OCR'd, different resolution, other kinds of optimization by software that reduces to wildly different file sizes, different file types (eg .chm, PDFs that are straight conversions from epub/mobi), etc. Some of this can be detected by machines, eg usage of OCR but some of the other things aren't easy at all to detect.

crazygringo · on Aug 21, 2022

What commercial compressor/performance are you talking about?

AFAIK the best compression you see is monochrome pages encoded in Group4, which for example ImageMagick will do which is open source, and ocrmypdf happily works on top of.

Otherwise it's just your choice of using underlying JPG, PNG, or JPEG 2000, and up to you to set your desired lossy compression ratio.

mjreacher · on Aug 23, 2022

This is a pretty common one I see:

https://www.pdf-tools.com/en/products/pdf-optimizer/

When I mean optimized I mean maintaining the page quality too. Obviously you can make the PDF look like crap but that's not very useful.

kragen · on Aug 22, 2022

PDF also supports JBIG.

bagrow · on Aug 21, 2022

> by filtering any "books" (rather, files) that are larger than 30 MiB we can reduce the total size of the collection from 51.50 TB to 18.91 TB

I can see problems with a hard cutoff in file size. A long architectural or graphic design textbook could be much larger than that, for instance.

mananaysiempre · on Aug 21, 2022

While it’s a bit of an extreme case, the file for a single 15-page article on Monte Carlo noise in rendering[1] is over 50M (as noise should specifically not be compressed out of the pictures).

[1] https://dl.acm.org/doi/10.1145/3414685.3417881

TigeriusKirk · on Aug 21, 2022

I was just checking my PDFs over 30M because of this post and was surprised to see the DALL-E 2 paper is 41.9M for 27 pages. Lots of images, of course, it was just surprising to see it clock in around a group of full textbooks.

elteto · on Aug 21, 2022

If I remember correctly images in PDFs can be stored full res but are then rendered to final size, which more often than not in double column research papers end up being tiny.

johndough · on Aug 21, 2022

That graph of file size vs. number of files would be much easier to read if it were logarithmic. I guess OP is using matplotlib. In this case, use plt.loglog instead of plt.plot. Also, consider plt.savefig("chart.svg") instead of png.

liberalgeneral · on Aug 21, 2022

Here is the raw data if you are interested: https://paste.debian.net/hidden/77876d00/

johndough · on Aug 21, 2022

Thanks. Here is a logarithmic plot as SVG: https://files.catbox.moe/zbf35r.svg

On a second thought, a logarithmic histogram might convey even more information, but that would require all file sizes to recompute the bin sizes.

kqr · on Aug 21, 2022

Huh, this distribution is not the power law I would have expected. Maybe because it's limited to one media type (books)?

M3L0NM4N · on Aug 22, 2022

Well it's a log graph.

kqr · on Aug 22, 2022

Yeah, and a power law type distribution would be a straight line on a log-log plot, which this is not.

agumonkey · on Aug 21, 2022

There are classes of books that are significantly larger than the rest, like medical / biology books. I don't know if they embed vector based images of the whole body or maybe hundreds of images but it's surprising big they are.

Who's in to make some large data gathering about unoptimized books and potentially redudant ones ? or maybe trim pdfs (qpdf can optimize a structure to an extent)

liberalgeneral · on Aug 21, 2022

Database dumps are available here if you are interested: http://libgen.rs/dbdumps/

libgen_compact_* is what you are probably looking for, but they are all SQL dumps so you'll need to import them into MySQL first. :/

agumonkey · on Aug 21, 2022

the dumps are not enough, one has too scan the actual file content to assess the quality

are you alone in your analysis or are there groups who try to improve lg ?

dudehere · on Aug 22, 2022

Such efforts have been made in the past but every time ceased at some point for complexity. A workgroup can be made to tackle it, though.

RcouF1uZ4gsC · on Aug 21, 2022

> I chose 30 MiB somewhat arbitrarily based on my personal e-book library, thinking "30 MiB ought to be enough for anyone"

There are books on art and photography and pathology that have multiple high resolution photographs.

I don’t think limiting by file size is a good idea.

signaru · on Aug 21, 2022

I've experienced scanning personal books and also try to reduce them since I'm also concerned with bloat on my (older) mobile reading devices. Unfortunately, there are reasons I cannot upload those, but the procedures might still be helpful for existing scans.

Use ScanTailor to clean them up. If there is no need for color/grayscale, have the output strictly black and white.

OCR them with Adobe Acrobat ClearScan (or something else, that is what I have).

Convert to black and white DJVU (Djvu-Spec).

Dealing with color is another thing, and takes some time. I find that using G'MIC's anisotropic smoothing can help with the ink-jet/half-tone patterns. But it's too time consuming to be used for books.

pronoiac · on Aug 21, 2022

I like ScanTailor! I've used ocrmypdf for the OCR and compression steps. It uses lossless JBIG2 by default, at 2 or 3k per page; I'm curious how that compares to DJVU. (And my mistake, pdf and DJVU are competing container formats.)

signaru · on Aug 21, 2022

If the PDF is from a scanned source, converting it to DJVU with equivalent DPI typically results to about half the file size (figures can vary depending on the specifics of the PDF source).

gmjoe · on Aug 21, 2022

Honestly, it's not a big problem.

First of all, bloat has nothing to do with file size -- EPUB's are often around 2 MB, typeset PDF's are often 2-10 MB (depending on quantity of illustrations), and scanned PDF's are anywhere from 10 MB (if reduced to black and white) to 100 MB (for colors scans, like where necessary for full-color illustrations).

The idea of a 30 MB cutoff does nothing to reduce bloat, it just removes many of the most essential textbooks. :( Also it's very rare to see duplicates of 100 MB PDF's.

Second, file duplication is there, but it's not really an unwieldy problem right now. Probably the majority of titles have only a single file, many have 2-5 versions, and a tiny minority have 10+. But they're often useful variants -- different editions (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs PDF scan. These are all genuinely useful and need to be kept.

Most of the unhelpful duplication I see tends to fall into three categories:

1) There are often 2-3 versions of the identical typeset PDF except with a different resolution for the cover page image. That one baffles me -- zero idea who uploads the extras or why. My best guess is a bot that re-uploads lower-res cover page versions? But it's usually like original 2.7 MB becoming 2.3 MB, not a big difference. Feels very unnecessary to me.

2) People (or a bot?) who seem to take EPUB's and produce PDF versions. I can understand how that could be done in a helpful spirit, but honestly the resulting PDF's are so abysmally ugly that I really think people are better off producing their own PDF's using e.g. Calibre, with their own desired paper size, font, etc. Unless there's no original EPUB/MOBI on the site, PDF conversions of them should be discouraged IMHO

3) A very small number of titles do genuinely have like 5+ seemingly identical EPUB versions. These are usually very popular bestselling books. I'm totally baffled here as to why this happens.

It does seem like it would be a nice feature to be able to leave some kind of crowdsourced comments/flags/annotations to help future downloaders figure out which version is best for them (e.g. is this PDF an original typeset, a scan, or a conversion? -- metadata from the uploader is often missing or inaccurate here). But for a site that operates on anoynmity, it seems like this would be too open to abuse/spamming. Being able to delete duplicates opens the door to accidental or malicious deleting of anything. I'd rather live with the "bloat", it's really not an impediment to anything at the moment.

titoCA321 · on Aug 21, 2022

When you look at movie pirates, there's still uploads of Xvid in 2022. Crap goes in as PDF, mobi, epub, txt and comes out as PDF, mobi, DOCX, txt.

cbarrick · on Aug 21, 2022

This is a complete nit, but

    s/an utopia/a utopia/

Even though "utopia" is spelled starting with a vowel, it is pronounced as /juːˈtoʊpiə/, like "yoo-TOH-pee-ə", with a consonant sound at the start.

Since the word starts with a consonant sound, the proper indefinite article is "a".

kevin_thibedeau · on Aug 21, 2022

Now you have to convince the intelligentsia how to use the proper article with "history".

zozbot234 · on Aug 22, 2022

Why do we care so much about "history" anyway? Why not "herstory"?

repple · on Aug 21, 2022

This book has a great overview of the origins of library genesis.

Shadow Libraries: Access to Knowledge in Global Higher Education

https://libgen.is/search.php?req=shadow+libraries

wishfish · on Aug 21, 2022

Has anyone ever stumbled across an executable on LibGen? The article mentioned finding them but I've never seen one.

I agree with the other comments that LibGen shouldn't purge the larger books. But, in terms of mirrors, it would be nice to have a slimmed down archive I could torrent. 19 TB would be manageable. And would be nice to have a local copy of most of the books.

liberalgeneral · on Aug 21, 2022

> Has anyone ever stumbled across an executable on LibGen? The article mentioned finding them but I've never seen one.

Here is a list of .exe files in LibGen: https://paste.debian.net/hidden/1c82739a/

And a breakdown of file extensions: https://paste.debian.net/hidden/579e319c/

> And would be nice to have a local copy of most of the books.

Yes! That was my intention—I wasn't advocating for a purge of content but a leaner and more practical version would be amazing.

macintux · on Aug 21, 2022

> Yes! That was my intention—I wasn't advocating for a purge of content but a leaner and more practical version would be amazing.

Your piece doesn't make that obvious at all, and given how many people here are misunderstanding that point, you might want to update it.

liberalgeneral · on Aug 21, 2022

You are right, added a paragraph at the end.

marcosdumay · on Aug 21, 2022

So, 1000 exes and 500 isos (that may be problematic, but most probably aren't). Everything else seems to be what one would expect.

That's way cleaner than I could possibly expect. Do people manually review suspect files?

wishfish · on Aug 21, 2022

Thanks for the lists. I was genuinely curious about the exes. Nice to know where they originate. Interesting that over half of them have titles in Cyrillic. I guess not so many English language textbooks (with included CDs) have been uploaded with the data portion intact.

hoppyhoppy2 · on Aug 21, 2022

I saw a book on antenna design on libgen that originally included a CD with software, and that disk image had been uploaded to the site.

pnw · on Aug 21, 2022

You can search by extension - there’s a lot of .exe files, mostly Russian AFAIK.

Synaesthesia · on Aug 21, 2022

>"30 MiB ought to be enough for anyone"

Sometimes you have eg a history book which has a lot of high quality photos, and then it can be quite large.

powera · on Aug 21, 2022

Curation is hard, particularly for a "community" project.

Every file is there for a reason, and much of the time, even if it is a stupid reason, removing it means there is one more person opposed to the concept of "curation".

Hizonner · on Aug 21, 2022

Um, if the goal is to fit what you can onto a 20TB hard drive at home, then nobody is stopping you from choosing your own subset, as opposed to deleting stuff out of the main archive based on ham-handed criteria...

_Algernon_ · on Aug 21, 2022

My main issue with libgen is its awful search. Can't search by multiple criteria, shitty fuzzy search, and cant filter by file type.

liberalgeneral · on Aug 21, 2022

Z-Library has been innovating a great deal in that regard. Sadly they are not as open/sharing as LibGen mirrors in giving back to the community (in terms of database dumps, torrents, and source code).

MichaelCollins · on Aug 21, 2022

Have you ever used a card catalogue?

_Algernon_ · on Aug 21, 2022

No. Your point being?

MichaelCollins · on Aug 21, 2022

In that case your expectations are understandable. People in your generation are accustomed to finding anything in mere seconds. Not very long ago, if it took you a few minutes to find a book in the catalogue you would count yourself lucky. And if your local library didn't have the book you're looking for, you could spend weeks waiting for the book to arrive from another library in the system.

Libgen's search certainly isn't as good as it could be, but it's more than good enough. If you can't bear spending a few minutes searching for a book, can you even claim to want that book in the first place? It's hard for me to even imagine being in such a rush that a few minutes searching in a library is too much to tolerate. But then again, I wasn't raised with the expectations of your generation.

dudehere · on Aug 22, 2022

While it's nice to see people reading, learning, and loving libraries, keep in mind the Library Genesis remnants you are typically using are money hogs covering their profiteering under the original altruistic LG disguise. They don't produce forks and link up everyone to work for their own growth. That's not what LG used to be.

rolling_robot · on Aug 21, 2022

The graph should be in logarithmic scale to be readable, actually.

remram · on Aug 21, 2022

https://news.ycombinator.com/item?id=32540202

Invictus0 · on Aug 21, 2022

Storage space is not a problem, especially not on the order of terabytes. If you want to download all of libgen on a cheap drive, perhaps limit yourself to epub files only. No one needs all of libgen anyway except archivists and data hoarders.

liberalgeneral · on Aug 21, 2022

https://news.ycombinator.com/item?id=32540854

Invictus0 · on Aug 21, 2022

Yes, that makes you a data hoarder. Normal people would just use one of the many other methods of getting free books, like legal libraries, googling it on Yandex, torrents, asking a friend, etc. Or just actually pay for a book.

liberalgeneral · on Aug 21, 2022

My target audience is not normal people though, and I don't mean this in the "edgy" sense. The fact that we are having this discussion is very abnormal to begin with, and I think it's great that there are some deviants from the norm who care about the longevity of such projects.

I can imagine many students and researchers hosting a mirror of LibGen for their fellows for example.

Invictus0 · on Aug 21, 2022

In that case, just pay whatever it costs to store the data. With AWS glacier it would cost $50 a month.

idealmedtech · on Aug 21, 2022

I'd love to see that distribution at the end with a log-axis for the file size! Or maybe even log-log, depending. Gives a much better sense of "shape" when working with these sorts of exponential distributions

spiffistan · on Aug 21, 2022

I've been dreaming of a book decompiler that would some newfangled AI/ML to produce a perfectly typeset copy of an older book; in the same font or similar, recognizing multiple languages and scripts within the work.

copperx · on Aug 21, 2022

In the same vein, I would like an e-reader that has TeX or InDesign quality typesetting. I'd settle for Knuth-Plass line breaking with decent justification (and hyphenation).

At the very least, make it so that headings do not appear at the bottom of a page. Who thought that was OK?

Retr0id · on Aug 21, 2022

In an ideal world, every book could be given an "importance" score, for some arbitrary value of importance. For example, how often it is cited. This could be customised on a per-user basis, depending on which subjects and time periods you're interested in.

Then you can specify your disk size, and solve the knapsack problem to figure the optimal subset of files that you should store.

Edit: Curious to see this being downvoted. Is it really that bad of an idea? Or just off-topic?

DiggyJohnson · on Aug 21, 2022

Not to sound blunt, but answering your question on the downvotes (which you probably didn’t deserve, especially without reply).

The concept of an importance score feels very centralized and against the federated / free nature of the site. Towards what end?

If the “importance score” impacts curation, I am strongly against it. Not only is it icky, but how is it different than a function of popularity?

Retr0id · on Aug 21, 2022

I'm not suggesting reducing the size of the LibGen collection, I'm thinking along the lines of "I have 2TB of disk space spare, and I want to fill it with as much culturally-relevant information as possible".

If the entire collection were availble as a torrent (maybe it already is?), I could select which files I wish to download, and then seed.

Those who have 52TB to spare would of course aim to store everything, but most people don't.

Just as the proposal in the OP would result in the remaining 32.59 TB of data being less well replicated, my approach has the problem that less "popular" files would be poorly replicated, but you could solve that by also selecting some files at random. (e.g. 1.5TB chosen algorithmically, 0.5TB chosen at random).

liberalgeneral · on Aug 21, 2022

I don't think you've deserved the downvotes, and I don't think it's a bad idea either; indeed some coordination as to how to seed the collection is really needed.

For instance phillm.net maintains a dynamically updated list of LibGen and Sci-Hub torrents with less than 3 seeders so that people can pick some at random and start seeding: https://phillm.net/libgen-seeds-needed.php

ssivark · on Aug 21, 2022

Seems like a perfectly good idea to me! Basically how proposing that we decide caching by some score, and the details of the score function be tweaked to handle the different aspects we care for.

I wonder whether this idea is already used for locating data in distributed systems — from clusters all the way to something like IPFS.

antimony51 · on Aug 22, 2022

Has any effort been made to remove/remedy "all sorts of binary data"?

kragen · on Aug 22, 2022

Maybe if the objective is preservation, instead of each person saving an entire copy of libgen locally, people [in a country where this is legal] should save N-of-M shares of it. 51.50 TB in a 5-of-M shares setup would be under 11 TB per share; if M were 16-32 or so, and the community remained sufficiently active to replace the shares held by lapsed participants, it would have a good chance of surviving the next big historical period of book-burnings.

A 2TB disk apparently costs about US$40 right now (https://www.mercadolibre.com.ar/disco-duro-interno-western-d... for example) so this is a contribution of about US$200 per participant. Plus the risk of being arrested in the future for possessing forbidden information, of course, but maybe the fact that you can't decrypt it without four other participants would reduce that risk.

Of course it's also worthwhile to keep plaintext copies of books you actually read, or might want to read, or want to pretend you actually read. My copy of Kenneth Snelson's Art and Ideas is 41 MB and 174 pages (US$0.0008, 240 kB/page); my copy of Boole's Treatise on the Calculus of Finite Differences is 19 MB and 356 pages (US$0.0004, 53 kB/page); my copy of Kevin Carson's Homebrew Industrial Revolution is 3.8 MB and 399 pages (US$0.00008, 9.5 kB/page). If you were to devote a single terabyte to books for yourself at 10 megabytes per book, you'd still have room for 100'000 books, quite a nice library by any historical standard, even if it's small compared to all of libgen. Perhaps a small group of people [again, in a country where this is legal] participating in a distributed storage system could preserve significant fractions of libgen in such a way even in the face of disaster.

I think it's reasonable to weight such preservation efforts toward lighter-weight books, but I also think it's easy to screw that up, for example by keeping only books under 30MB, which throws away all the decent scans of many books, leaving only worthless epub versions.

At this point, though, it might be more important to create durable physical artifacts encoding this incomparable treasure; many historical events have obliterated communities of learning, leaving only artifacts. The Cambodian Killing Fields are one recent example, but we can also point to the Spanish Catholic zealots burning the khipu and the Maya codices; the Boxers burning most of the last copy of the Yongle Encyclopedia; Qin Shi Huang's Burning of the Books and Burying of the Scholars; the Christians' prohibition on the Egyptian religion, which brought to a close the millennia-long knowledge of hieroglyphs; and Sulla's conquest of Syracuse, in which Archimedes was killed and his knowledge of the integral calculus was lost until Newton.

keepquestioning · on Aug 21, 2022

Can we put LibGen on the blockchain?

FabHK · on Aug 21, 2022

In case that was not a joke:

No. LibGen is not a trivial amount of data (a few hard disks full). The blockchain can only handle tiny amounts of data very slowly.

aaron695 · on Aug 21, 2022

> by filtering any "books" (rather, files) that are larger than 30 MiB we can reduce the total size of the collection from 51.50 TB to 18.91 TB, shaving a whopping 32.59 TB

Books greater than 30 MiB are all the textbooks.

You are killing the knowledge.

Also killing a lot of rare things.

If you want to do something amazing and small, OCR them.

As an example of greater than 30 meg, I grabbed a short story by Greg Bear the other day not available digitally, it was in a 90 meg copy of a 1983 Analog Science Fiction and Fact

Side note de-duping is an incredibly hard project, how will you diff a mobi and a epub and then make a decision? Or a decision between a mobi and a mobi?

Books also change with time. Even in the 90's kids books from the 60's had been 'edited' These can be hidden gems to collectors. Cover art also.