And my point: assuming a random error, your SHA1 is no better than a simple chec...

jiggawatts · on Oct 12, 2021

The key thing about cryptographic hashes (not mere checksums) like SHA1 is that the distribution of the errors doesn't matter. Effectively, they're all the same. That's the point. If mere "runs" of bits were sufficient to trigger a collision, then the hash wouldn't be strong enough for cryptography!

This means that you can simply throw out any such modelling, as it is no longer relevant. You "care" only about bit-error-rates and hash collision rates, but even mere SHA1 is so thoroughly past the BER that it is essentially perfect. That is, it is indistinguishable from perfect on all pratical computers for all intents and purposes.

CRC codes have their uses, but if you just need to detect any corruption of a large-ish file (over 10KB), then cryptographic hashes are both fast and "perfect" in this physical sense. You will never get a collision with SHA256 or SHA512, even including adversarial, crafted inputs. The same "strength" attribute is not valid for CRC codes, they're vulnerable to deliberate corruption by an attacker.

So in that sense, SHA hashes are stronger that CRC checksums.

dragontamer · on Oct 13, 2021

> Effectively, they're all the same.

No they're not.

Birthday attack says that a 160-bit perfect cryptographic hash will have a collision with just 80-bits, on the average. This means that an 80-bit burst-error would probabilistically contain a potential SHA1 collision. (80-bits burst error doesn't mean that all the bits are flipped btw: it means that 80-bits have been randomized)

In contrast, CRC is designed specifically against burst errors. CRC is "regular" and "tweaked" in such a way that a 160-bit CRC would be immune to 160-bit burst errors of any and all kinds!

So if you care about burst errors, then CRC is in fact, better, than crypto-level hashes. And in practice, burst errors are the primary error that occurs in practice (scratches on a CD-ROM, bad sectors on a hard drive, lightning storm cuts out a few microseconds of WiFi, etc. etc.)

That is: noise isn't random in the real world. Noise is "clustered" around bursty events in practice.

--------

If burst-error is king, you can do far, far better than random methodologies. CRC is proof of that. That's why error distributions matter.

jiggawatts · on Oct 14, 2021

I did say that the birthday attack doesn't apply!

It only applies if you're comparing a large set of samples against each other. An example would be a "content-based indexing" system where a database primary key is the hash. Every insert then compares the hash against every entry that already exists. If there are 1 billion stored items, each 1 insert can have a potential collision with all 1 billion.

For validation, you have 1 input being compared against 1 valid value (or its hash/crc). There's no "billion inputs" in this scenario... just 1 potentially corrupt vs 1 known good.

Hence, no birthday attack.

It's the difference between two random people meeting and having the same birthday, versus any two people in a room full of people having the same birthday. Not the same scenario!

In practice, cryptographic hashes are always superior to checksums, once both have more than 128 bits. They're both strong enough, but the cryptographic has is resistant to deliberate attacks. The CRC won't be.