I'm curious about the setting in which you saw these failures, could you elaborate?
Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.
> This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure.
No, that's what you need to generate one guaranteed failure, when enumerating different random corruption possibilities. Simply because 32-bit number can at most represent 2^32 different states.
In practice, you'd have 50% probability to have a collision for every 32 TB... assuming perfect distribution.
By the way, 32 TB takes just 4-15 hours to read from a modern SSD. Terabyte is just not so much data nowadays.
Just a nit: you don't get a guaranteed failure at 64TB, you get a failure with approx 1-1/e ~= 63% probability. At 32TB you get a failure with approx 1-1/sqrt(e) ~= 39% probability.
I do agree that tens of TBs are not too much data, but mind that this probability means that you need to feed your checksum 64TB worth of 16KB blocks, every one of them being corrupt, to let at least one of them go trough unnoticed with 63% probability. So you don't only need to calculate with the throughput of your SSD, but the throughput multiplied with the corruption rate.
For disk storage CRC-32C is still non-broken. You can't say the same about on-board communication protocols or even some LANs.
When people started using CRC-32 it was because with the technology of the time it was virtually impossible to see collisions. Nowadays we are discussing if it's reasonable to expect a data volume that gives you 40% or 60% of collision chance.
CRC32 end is way overdue. We should standardize on a CRC64 algorithm soon, or will have our hands forced and probably stick with a bad choice.
Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.