First of all, bloat has nothing to do with file size -- EPUB's are often around 2 MB, typeset PDF's are often 2-10 MB (depending on quantity of illustrations), and scanned PDF's are anywhere from 10 MB (if reduced to black and white) to 100 MB (for colors scans, like where necessary for full-color illustrations).
The idea of a 30 MB cutoff does nothing to reduce bloat, it just removes many of the most essential textbooks. :( Also it's very rare to see duplicates of 100 MB PDF's.
Second, file duplication is there, but it's not really an unwieldy problem right now. Probably the majority of titles have only a single file, many have 2-5 versions, and a tiny minority have 10+. But they're often useful variants -- different editions (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs PDF scan. These are all genuinely useful and need to be kept.
Most of the unhelpful duplication I see tends to fall into three categories:
1) There are often 2-3 versions of the identical typeset PDF except with a different resolution for the cover page image. That one baffles me -- zero idea who uploads the extras or why. My best guess is a bot that re-uploads lower-res cover page versions? But it's usually like original 2.7 MB becoming 2.3 MB, not a big difference. Feels very unnecessary to me.
2) People (or a bot?) who seem to take EPUB's and produce PDF versions. I can understand how that could be done in a helpful spirit, but honestly the resulting PDF's are so abysmally ugly that I really think people are better off producing their own PDF's using e.g. Calibre, with their own desired paper size, font, etc. Unless there's no original EPUB/MOBI on the site, PDF conversions of them should be discouraged IMHO
3) A very small number of titles do genuinely have like 5+ seemingly identical EPUB versions. These are usually very popular bestselling books. I'm totally baffled here as to why this happens.
It does seem like it would be a nice feature to be able to leave some kind of crowdsourced comments/flags/annotations to help future downloaders figure out which version is best for them (e.g. is this PDF an original typeset, a scan, or a conversion? -- metadata from the uploader is often missing or inaccurate here). But for a site that operates on anoynmity, it seems like this would be too open to abuse/spamming. Being able to delete duplicates opens the door to accidental or malicious deleting of anything. I'd rather live with the "bloat", it's really not an impediment to anything at the moment.
First of all, bloat has nothing to do with file size -- EPUB's are often around 2 MB, typeset PDF's are often 2-10 MB (depending on quantity of illustrations), and scanned PDF's are anywhere from 10 MB (if reduced to black and white) to 100 MB (for colors scans, like where necessary for full-color illustrations).
The idea of a 30 MB cutoff does nothing to reduce bloat, it just removes many of the most essential textbooks. :( Also it's very rare to see duplicates of 100 MB PDF's.
Second, file duplication is there, but it's not really an unwieldy problem right now. Probably the majority of titles have only a single file, many have 2-5 versions, and a tiny minority have 10+. But they're often useful variants -- different editions (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs PDF scan. These are all genuinely useful and need to be kept.
Most of the unhelpful duplication I see tends to fall into three categories:
1) There are often 2-3 versions of the identical typeset PDF except with a different resolution for the cover page image. That one baffles me -- zero idea who uploads the extras or why. My best guess is a bot that re-uploads lower-res cover page versions? But it's usually like original 2.7 MB becoming 2.3 MB, not a big difference. Feels very unnecessary to me.
2) People (or a bot?) who seem to take EPUB's and produce PDF versions. I can understand how that could be done in a helpful spirit, but honestly the resulting PDF's are so abysmally ugly that I really think people are better off producing their own PDF's using e.g. Calibre, with their own desired paper size, font, etc. Unless there's no original EPUB/MOBI on the site, PDF conversions of them should be discouraged IMHO
3) A very small number of titles do genuinely have like 5+ seemingly identical EPUB versions. These are usually very popular bestselling books. I'm totally baffled here as to why this happens.
It does seem like it would be a nice feature to be able to leave some kind of crowdsourced comments/flags/annotations to help future downloaders figure out which version is best for them (e.g. is this PDF an original typeset, a scan, or a conversion? -- metadata from the uploader is often missing or inaccurate here). But for a site that operates on anoynmity, it seems like this would be too open to abuse/spamming. Being able to delete duplicates opens the door to accidental or malicious deleting of anything. I'd rather live with the "bloat", it's really not an impediment to anything at the moment.