IMHO a process which is lossy should never be described as deduplication.
What would work out fairly well for this use case is to group files by similarity, and compress them with an algorithm which can look at all 'editions' of a text.
This should mean that storing a PDF with a (perhaps badly, perhaps brilliantly) type-edited version next to it would 'weigh' about as much as the original PDF plus a patch.
> IMHO a process which is lossy should never be described as deduplication.
Depends. There are going to be some cases where files aren't literally duplicates, but the duplicates don't add any value -- for example, MOBI conversions of EPUB files, or multiple versions of an EPUB with different publisher-inserted content (like adding a preview of a sequel, or updating an author's bibliography).
Splitting those into two cases: I think getting rid of format conversions (which can, after all, be performed again) is worthwhile, but isn't deduplication, that's more like pruning.
Multiple versions of an EPUB with slightly different content is exactly the case where a compression algorithm with an attention span, and some metadata to work with can, get the multiple copies down enough in size that there's no point in disposing of the unique parts.
What would work out fairly well for this use case is to group files by similarity, and compress them with an algorithm which can look at all 'editions' of a text.
This should mean that storing a PDF with a (perhaps badly, perhaps brilliantly) type-edited version next to it would 'weigh' about as much as the original PDF plus a patch.