Hacker News new | past | comments | ask | show | jobs | submit login

There are classes of books that are significantly larger than the rest, like medical / biology books. I don't know if they embed vector based images of the whole body or maybe hundreds of images but it's surprising big they are.

Who's in to make some large data gathering about unoptimized books and potentially redudant ones ? or maybe trim pdfs (qpdf can optimize a structure to an extent)




Database dumps are available here if you are interested: http://libgen.rs/dbdumps/

libgen_compact_* is what you are probably looking for, but they are all SQL dumps so you'll need to import them into MySQL first. :/


the dumps are not enough, one has too scan the actual file content to assess the quality

are you alone in your analysis or are there groups who try to improve lg ?


Such efforts have been made in the past but every time ceased at some point for complexity. A workgroup can be made to tackle it, though.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: