And can also give guaranteed distance properties-- so you can e.g. guarantee all...

CorrectHorseBat · on Oct 12, 2021

SHA will detect corruption of any size, but it can't be used for correction.

nullc · on Oct 12, 2021

It (and particularly truncated hashes made from it) is not _guaranteed_ to detect small corruptions. It will falsely admit 1 in 2^bits. A well matched CRC is can give guaranteed detection even with a very short hash.

adrian_b · on Oct 12, 2021

True, but one must not forget that the good error detection properties of CRCs are guaranteed only for limited data lengths.

The CRCs of usual sizes, up to 32 bits, are perfectly fine for data sizes in the kilobyte range, but they become poor at megabyte sizes and completely useless at gigabyte sizes.

The CRCs have their domain of application for which they are excellent, but some times they are misused, e.g. in various archive file formats, where the archive integrity is checked with a too short CRC instead of using a hash of appropriate length, e.g. SHA-1 or SHA-256. The longer hashes may be truncated, e.g. to 128 bit, which is likely enough for the current sizes of files and file systems, but 64-bit or less is not enough for the kind of files that are typically encountered now.

nullc · on Oct 12, 2021

> only for limited data lengths.

That's true-- and it's why my message up thread said so explicitly.

Though you can design CRC-like codes which work over extremely long cycle lengths pretty easily, if that is your objective.

CorrectHorseBat · on Oct 12, 2021

Is 64 bit really not enough? 2^64 is a very big number.

adrian_b · on Oct 12, 2021

The size needed for an error detection code depends on the size of the data from which it was computed, on the error rate and on the time that passes since the error detection code is generated until it is verified.

With data sent through communication channels, the error rate may be high, but the data reaches its destination after a few seconds. On an archival storage medium, the error rate may be very low, but you might check for corruption only some years later.

If you want to decide which is the acceptable error detection length, the best way would be to make a simulation, with random data of the expected length to which various random errors having the expected number should be applied. Then you can test the alternative error detection codes to see whether they detect the errors or not.

I expect that for data sizes of a few megabytes a 64-bit CRC should be OK.

From my experiments with gigabyte-sized files, 32-bit CRCs were completely inadequate, 128-bit hashes were OK and 64-bit CRCs were inconclusive.

(By OK and inconclusive, I mean that 128-bit hashes detected all the errors generated, but I have seen 2 cases not detected by CRC-64 in several tens of millions of tests at a 5 GB file size, which was more than expected but statistically insignificant.)

Because the speed of 128-bit to 256-bit non-CRC hashes, e.g. Poly1305, SHA-1, SHA-256 or BLAKE3 is very high on modern CPUs, i.e. much faster than any CRC implementation that does not use PCLMUL or equivalent instructions, and of similar speed to a 64-bit CRC with PCLMUL after the overhead for data movement is accounted for, and because there is very little published information about the properties of the few 64-bit CRCs known, I have not made any extra experiments to confirm whether a 64-bit CRC might be adequate, because using any 128-bit hash was more certain to give good results without any disadvantages.

On the other hand, for a hardware implementation it would make sense to determine whether a 64-bit CRC is good enough for a certain application.

dragontamer · on Oct 12, 2021

The thing is, CRC is the "cheap" trick.

The "expensive" trick is to go full Reed Solomon error correction if you really care about data (aka: RAID5, RAID6, etc. etc.)... which is built up from the same theory as CRCs except more comprehensive at errors (fulfilling error correction as well).

> If you want to decide which is the acceptable error detection length, the best way would be to make a simulation, with random data of the expected length to which various random errors having the expected number should be applied. Then you can test the alternative error detection codes to see whether they detect the errors or not.

If that error is uniformly random, then a 128-bit sum is optimal and no different than 128-bit hash.

The reason we have this debate at all, is because different groups of people make different assumptions about the distribution of errors.

EDIT: I think it is reasonable to assume that burst-errors would occur on a storage medium: perhaps per-sector (Hard Drive) or per-block (SSD). Reed Solomon when applied to RAID assumes that an entire hard drive fails all at once and all together, for example.

As such: the key when discussing checksums / error detection / error correction is figuring out exactly how errors occur, and what to do about them (aka: detect vs correct).

The more modern schools of thought are to force errors to look as random as possible (hash and LDPC), while older schools of thought are to analyze the nature of errors and focus on them (CRC and Reed Solomon)

nullc · on Oct 12, 2021

> If that error is uniformly random, then a 128-bit sum is optimal and no different than 128-bit hash.

If the data you read is uniformly random then what you say is correct. But if there is a single error (bit/byte/word whatever) which is placed uniformly randomly and takes a uniformly random value then other codes are vastly stronger than using a hash.

While what you say is technically correct-- that a 'uniformly random error' means replacing the input with an entirely independent input, that isn't what most people would call an 'error' and isn't the kind of error that most applications of a check value would detect.

adrian_b · on Oct 12, 2021

Also, it must be kept in mind that CRCs, i.e. the remainders of polynomial division, are not the only simple way to obtain predictable error properties.

A simple alternative is to evaluate the polynomial corresponding to the data at a certain point and use the value as the error detection code. The properties about short bursts are identical.

Polynomial evaluation is advantageous at large hash sizes instead of CRC, because it matters very little at which point you choose to evaluate the polynomial, as long as it is non-null and non-unit.

With CRCs, the choice of the primitive polynomial to be used as divisor is critical and you cannot choose just a random 128-bit sequence and hope that it happens to correspond to a primitive polynomial and also to one with good error detection properties.

For example, if you want guaranteed error-burst detection for a data size where the known 32-bit or 64-bit CRCs are inadequate, you may use the Poly1305 evaluation functions available in most cryptographic libraries for data authentication.

When used in cryptography, the point chosen for the polynomial evaluation must be secret, but for error detection it may be fixed and public and you get error detection properties equivalent with an 128-bit CRC.

An optimized 128-bit CRC might be a little better, but good luck to finding the right 128-bit primitive polynomial.