I know they dedupe on the file level, but I wonder if they are doing block level deduping... as without a big shared storage infrastructure block level deduping becomes pretty hard to serve at high speed from as the reads potentially become distributed across hundreds of nodes...
To build out web scale systems you generally use commodity gear and accept the overhead of duplication, heavy deduping requires massive IO, and there is no way i can see you can be dealing with that much data have that level of IOPS and be profitable charging what they charge.
FYI, the data is already comprehensively de-duped.