Hacker News new | past | comments | ask | show | jobs | submit login

> Now you could De-dupe... but the hardware to process and dedupe that much data.... not really an option.

FYI, the data is already comprehensively de-duped.




I know they dedupe on the file level, but I wonder if they are doing block level deduping... as without a big shared storage infrastructure block level deduping becomes pretty hard to serve at high speed from as the reads potentially become distributed across hundreds of nodes...

To build out web scale systems you generally use commodity gear and accept the overhead of duplication, heavy deduping requires massive IO, and there is no way i can see you can be dealing with that much data have that level of IOPS and be profitable charging what they charge.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: