IMHO a process which is lossy should never be described as deduplication. What w...

duskwuff · on Aug 21, 2022

> IMHO a process which is lossy should never be described as deduplication.

Depends. There are going to be some cases where files aren't literally duplicates, but the duplicates don't add any value -- for example, MOBI conversions of EPUB files, or multiple versions of an EPUB with different publisher-inserted content (like adding a preview of a sequel, or updating an author's bibliography).

samatman · on Aug 21, 2022

Splitting those into two cases: I think getting rid of format conversions (which can, after all, be performed again) is worthwhile, but isn't deduplication, that's more like pruning.

Multiple versions of an EPUB with slightly different content is exactly the case where a compression algorithm with an attention span, and some metadata to work with can, get the multiple copies down enough in size that there's no point in disposing of the unique parts.