Dunno, Mark Adler is a name that needs no introduction or 'flex'. Fun fact about him: with his name (family) he comes on top of the list of authors of virtually any project that he'd have participated and ordered alphabetically.
We used CPU intrinsics in version 9 of GNU coreutils, for the cksum utility. From the release notes:
"cksum [-a crc] is now up to 4 times faster by using a slice by 8 algorithm, and at least 8 times faster where pclmul instructions are supported."
Implementing that portably is a bit tricky, as one must consider:
- support various compilers which may not support intrinsics
- runtime checks to see if the current CPU supports the instructions
- ensure compiler options enabling the instructions are restricted to their own lib to ensure the don't leak into unprotected code.
- automake requires using a separate lib for this rather than just a separate compilation unit
The CRC is a polynomial / Galois field, and today's CPUs have polynomial multiply (aka: pmul on ARM), or carryless multiply (aka: PCLMULQDQ on x86). These instructions can implement the "tough" part of the CRC in just one clock tick.
the C code looks really nice and very generic, and, it being a generator... would it be too difficult to make use of those CPU instructions during the code generation? (then it could even create architecture-specific code, which looks like a plus to me)
A similar code generator for CRC algorithms is https://pycrc.org/ . I don't know enough to compare the two, but one difference that I notice is that pycrc supports 4-bit lookup tables, where as crcany does not seem to.
A 4-bit lookup table consumes only 16 elements instead of 256 elements, so for example, the CRC32 table consumes only 64 bytes instead of 1024 bytes. This can be very helpful in microcontroller environments with limited flash memory. On the 32-bit microcontrollers that I have tested (SAMD21, STM32, ESP8266, ESP32, Teensy 3.2), using the 4-bit lookup table is ~4X faster than the bit-by-bit algorithm, but only ~2X slower than the byte-by-byte algorithm, so it can be a good compromise between memory and speed.
If you find yourself debugging the mysteries of hardware accelerated CRC (for instance in some poorly documented embedded device), this calculator saved me multiple times. It’s pure gold.
I think most CRC libraries should have a rolling implemenation built in (i.e. as you read bit by bit, or word by word, it calculates the crc for the last N bits (i.e. a rolling crc32 would allow you to read bit by bit and get the crc for the last 32 bits read). Its actually relatively straight forward to do, and surprisingly useful, but post libraries don't implement it.
Generally, given the CPU acceleration available, one should generally prefer SHA1 or SHA256. CRC/MD5 is really only advisable if you're being forced into it for some legit backward compatibility reason. Most embedded CPUs these days can handle SHA variants just fine (I've used SHA256 on an M3 and even more crazily on the Kalimba of a CSR8605 which is a 48-bit DSP with a char size of 24 bits as well as its coprocessor which is a 32-bit CPU which a char size of 16 bits).
For perfectly uniform random-errors, it doesn't matter how you hash / error check. But if there's a certain class of errors you expect, you can do far better than just random.
In particular: CRC are theoretically perfect/ideal for burst errors. A CRC32 can detect all burst errors of size 32 (aka: if the error is within 32-bits, the CRC32 will always catch it). A CRC extended out to 256-bits would catch all 256-bit bursts, which SHA1 cannot do.
EDIT: "Burst errors" are incredibly common in communications. If there's some kind of electrical noise (static shock or the like) would destroy maybe 20-bits all in a row. Aka: a 20-bit burst error.
Similarly, CD-ROMs were designed to primarily protect against burst errors, because people assumed that CDs would get scratched. I forget exactly how the calculations were done, but some number of "millimeters of damage" were the assumed damage (maybe 4mm or 5mm) as typical, and then CD-ROMs were designed to be resilient even in the face of scratches of that size (or less).
You are right, but I have not seen any published data about the properties of some concrete 128-bit or 256-bit CRC's that could be chosen.
For 32-bit or smaller CRC's there are a large number of publications analyzing various useful polynomials and there is also some published data about a few 64-bit polynomials.
So if you want to use a CRC larger than 64-bit, you must first spend some time to learn and understand very well the theory of CRCs, then search for some suitable polynomials and test them.
It is unlikely that this would be worthwhile in most cases.
In practice, CRCs of up to 64-bit are very useful to detect errors in data packets of limited length, like in the data packets sent and received through some communication protocol, or in the headers or other components of limited size of some structured kind of data file.
CRCs of the available lengths are usually not suitable to check the integrity of entire data files, because in modern computers the files have become too large and too many.
Therefore a file utility like the traditional UNIX cksum has become obsolete when used to compute CRC values for files.
(The latest GNU cksum has become a generic utility, where you can specify other hash algorithms instead of the default CRC-32)
While this is expected due to the short length of CRC-32, I have actually verified that when computing the CRC-32 values for all the files on my 2 TB SSD I have obtained about seven hundreds of CRC collisions, i.e. pairs of completely different files that had identical CRC-32 values.
Even with the obsolete 128-bit MD5 hash there were no collisions, but as another poster mentioned, on modern CPUs with hardware support MD5 is slower than SHA-1 or SHA-256, so there is no reason to ever use it.
While SHA-1 is insecure for digital signatures and other applications for which an opponent must not be able to find collisions, because SHA-1 remains faster than SHA-256, it remains useful for error detection in many cases, e.g. for detecting whole file corruption.
> You are right, but I have not seen any published data about the properties of some concrete 128-bit or 256-bit CRC's that could be chosen.
I mean, sure. But what published data do you have about SHA1's error detection properties?
SHA1 doesn't have a hamming distance analysis associated with it, not at all... nor is there any kind of "burst error" analysis (or the like) associated with it. The only analysis you got on SHA1 is differential cryptography, and other such maths that have little or no relevance with regards to the kinds of errors we expect to see in data.
Its not too hard to pick a 128-bit burst error CRC-128. In fact, all CRC-128 would detect 128-bit burst errors. The "high quality CRC" is about the difficult question of hamming distances and other such properties.
-----
IMO: picking SHA1 is basically saying you don't care about those details. Which is a fine choice to make. But it seems odd that you'd criticize CRC for "not having enough analysis" when SHA1 barely has any at all.
It's been a while, but is it not the other way around?
CRCs can detect (or correct for y smaller than x) up to x bits in a w bit message. If there are more errors in a burst CRC will not catch it. SHA will detect any error of any size, but you can't correct it so for many applications such as cd's it's not very useful.
IIRC in CDs they solved this by simply spreading the RS correction codes and data around so bursts aren't bursts anymore.
> If there are more errors in a burst CRC will not catch it.
*Might* not catch it. Just like how SHA1 might not catch a 3-bit error if you find a collision in just the right spots.
Consider a 1024-bit message, and 1023-bits are in error. The CRC would trivially, and obviously, catch the error (because this would be "obviously" a 1-bit parity error, and the CRC's bottom bit is always a simple 1-bit parity)
In fact, there's all sorts of errors CRC are guaranteed to catch. The "burst error" is the one that most people care about, but "odd errors" (ex: 511-bit or 4095-bit errors, thanks to the bottom parity bit) are guaranteed to be caught as well.
---------
CRC is "so regular", it is very easy to prove the specific kinds of errors that can, or cannot, be caught. According to classical theory of communications, there are really only two errors to care about:
1. Random errors -- your methodology doesn't matter. Simple checksum or even XOR is provably optimal vs random errors (!!), and indistinguishable from a SHA1.
2. Burst errors -- Your connection goes dead for a time: this leads to 10 bits all in a row in error (lets say some electrical storm or static-shock, or other such phenomenon causes the antenna / line / whatever to be dead for 10-microseconds, on a 1MHz baud rate connection. That's 10-bits that disappear, aka a 10-bit burst error).
Because #1 doesn't care "which methodology" you use, its all about optimizing against #2.
> A CRC extended out to 256-bits would catch all 256-bit bursts, which SHA1 cannot do.
This is technically true, but misses the practical point. The chance of a contiguous 256-bit burst causing a SHA1 hash collision with the original unmodified byte string is 1 in 2^160, which is about 1 per 10^48 for a given input value. This is 1 in ( 1 trillion )^4. Or one per a trillion-trillion-trillion-trillion.
(Remember that the "birthday paradox" doesn't apply: we're not comparing a large set of values against all of the others, we're comparing a value with its may-be-corrupted copy only.)
For reference, typical computer hardware has 1 bit error per hour for each 10^11 to 10^30 bits stored or processed.[1] Just cosmic rays or background radiation alone will cause bit flips in all hardware at a rate comparable to this. While ECC memory and ECC processor caches reduce the error rate, they don't eliminate it. Typical improvements for contiguous errors in ECC DRAM are a factor of 256 (8-bit parity).
This means that even if you had a perfect checksum, the boolean returned by the "crc256()" function itself is a physical bit in memory subject to this corruption! Hence, there's a "noise floor" that no checksum can improve upon, because the output storage is also noisy. It might correctly detect the error, but then its output will be flipped back to "no error" by an errant relativistic particle!
The SHA1 hash collision rate is 10^16 times better than even the optimistic 10^30 error rate for reliable components even with an 8-bit ECC thrown on top! In other words, even on really, really good hardware, you'll get a thousand trillion times more false detections caused by bit flips than missed detections caused by the choice of hash algorithm. That's with SHA1. If you're paranoid, SHA512 is in the league of "not if every atom in the universe was a computer" levels of assurance.
As long as the hash is significantly better than this "system-wide" bit error rate, it is effectively perfect when implemented by practical, physical computers. Further software improvements are pointless without accompanying improvements to the hardware.
PS: This kind of "analog" thinking related to digital computer reliability becomes important at large scales. For example, filesystems like ZFS are sensitive to bit flips. It has a known failure mode where a corruption of certain regions of its Merkle tree will corrupt the whole disk array. Similarly, I've heard of algorithms being redesigned to be tolerant of bit flips, returning at least partially valid data. E.g.: instead of maintaining a count and incrementing/decrementing it over time, prefer to measure the count from scratch every time. That way it'll be correct most of the time, instead of becoming permanently corrupt and staying that way.
And my point: assuming a random error, your SHA1 is no better than a simple checksum.
So what's your error model? You make a big post about the nature of errors but you fail to specify the distribution of errors. If your distribution of errors is uniformly random, then a 128-bit checksum is optimal.
The reason why checksum isn't used, is because in practice, there are long-strings of "0" in data and "0" for errors / communications. Checksum fails to distinguish between 20x 0s in a row and 200x 0s in a row. Is that a problem you expect to find in files?
If your distribution of errors is bursty, then 128-bit CRC is optimal.
What distribution of errors are you assuming for SHA1? What kind of file errors are you expecting that makes SHA1 better than other methodologies?
---------
If all errors are random, and you want to distinguish from the "multiple 0s" problem, then the Adler32-like checks are sufficient at fixing both those problems, and the Adler32 methodology obviously extends out to 128-bits (just have two 64-bit sums)
CRC, specifically, is designed to optimally check for burst errors of its full length. (32-bit CRC finds all 32-bit burst errors. 128-bit CRC would find all 128-bit burst errors). That all CRC is, and that's all CRC is designed / math'd out to do.
The key thing about cryptographic hashes (not mere checksums) like SHA1 is that the distribution of the errors doesn't matter. Effectively, they're all the same. That's the point. If mere "runs" of bits were sufficient to trigger a collision, then the hash wouldn't be strong enough for cryptography!
This means that you can simply throw out any such modelling, as it is no longer relevant. You "care" only about bit-error-rates and hash collision rates, but even mere SHA1 is so thoroughly past the BER that it is essentially perfect. That is, it is indistinguishable from perfect on all pratical computers for all intents and purposes.
CRC codes have their uses, but if you just need to detect any corruption of a large-ish file (over 10KB), then cryptographic hashes are both fast and "perfect" in this physical sense. You will never get a collision with SHA256 or SHA512, even including adversarial, crafted inputs. The same "strength" attribute is not valid for CRC codes, they're vulnerable to deliberate corruption by an attacker.
So in that sense, SHA hashes are stronger that CRC checksums.
Birthday attack says that a 160-bit perfect cryptographic hash will have a collision with just 80-bits, on the average. This means that an 80-bit burst-error would probabilistically contain a potential SHA1 collision. (80-bits burst error doesn't mean that all the bits are flipped btw: it means that 80-bits have been randomized)
In contrast, CRC is designed specifically against burst errors. CRC is "regular" and "tweaked" in such a way that a 160-bit CRC would be immune to 160-bit burst errors of any and all kinds!
So if you care about burst errors, then CRC is in fact, better, than crypto-level hashes. And in practice, burst errors are the primary error that occurs in practice (scratches on a CD-ROM, bad sectors on a hard drive, lightning storm cuts out a few microseconds of WiFi, etc. etc.)
That is: noise isn't random in the real world. Noise is "clustered" around bursty events in practice.
--------
If burst-error is king, you can do far, far better than random methodologies. CRC is proof of that. That's why error distributions matter.
It only applies if you're comparing a large set of samples against each other. An example would be a "content-based indexing" system where a database primary key is the hash. Every insert then compares the hash against every entry that already exists. If there are 1 billion stored items, each 1 insert can have a potential collision with all 1 billion.
For validation, you have 1 input being compared against 1 valid value (or its hash/crc). There's no "billion inputs" in this scenario... just 1 potentially corrupt vs 1 known good.
Hence, no birthday attack.
It's the difference between two random people meeting and having the same birthday, versus any two people in a room full of people having the same birthday. Not the same scenario!
In practice, cryptographic hashes are always superior to checksums, once both have more than 128 bits. They're both strong enough, but the cryptographic has is resistant to deliberate attacks. The CRC won't be.
And can also give guaranteed distance properties-- so you can e.g. guarantee all corruption of hamming distance <D within a message size of <W will be detected, -- which is a massive improvement in performance when that's what you need to catch.
It (and particularly truncated hashes made from it) is not _guaranteed_ to detect small corruptions. It will falsely admit 1 in 2^bits. A well matched CRC is can give guaranteed detection even with a very short hash.
True, but one must not forget that the good error detection properties of CRCs are guaranteed only for limited data lengths.
The CRCs of usual sizes, up to 32 bits, are perfectly fine for data sizes in the kilobyte range, but they become poor at megabyte sizes and completely useless at gigabyte sizes.
The CRCs have their domain of application for which they are excellent, but some times they are misused, e.g. in various archive file formats, where the archive integrity is checked with a too short CRC instead of using a hash of appropriate length, e.g. SHA-1 or SHA-256. The longer hashes may be truncated, e.g. to 128 bit, which is likely enough for the current sizes of files and file systems, but 64-bit or less is not enough for the kind of files that are typically encountered now.
The size needed for an error detection code depends on the size of the data from which it was computed, on the error rate and on the time that passes since the error detection code is generated until it is verified.
With data sent through communication channels, the error rate may be high, but the data reaches its destination after a few seconds. On an archival storage medium, the error rate may be very low, but you might check for corruption only some years later.
If you want to decide which is the acceptable error detection length, the best way would be to make a simulation, with random data of the expected length to which various random errors having the expected number should be applied. Then you can test the alternative error detection codes to see whether they detect the errors or not.
I expect that for data sizes of a few megabytes a 64-bit CRC should be OK.
From my experiments with gigabyte-sized files, 32-bit CRCs were completely inadequate, 128-bit hashes were OK and 64-bit CRCs were inconclusive.
(By OK and inconclusive, I mean that 128-bit hashes detected all the errors generated, but I have seen 2 cases not detected by CRC-64 in several tens of millions of tests at a 5 GB file size, which was more than expected but statistically insignificant.)
Because the speed of 128-bit to 256-bit non-CRC hashes, e.g. Poly1305, SHA-1, SHA-256 or BLAKE3 is very high on modern CPUs, i.e. much faster than any CRC implementation that does not use PCLMUL or equivalent instructions, and of similar speed to a 64-bit CRC with PCLMUL after the overhead for data movement is accounted for, and because there is very little published information about the properties of the few 64-bit CRCs known, I have not made any extra experiments to confirm whether a 64-bit CRC might be adequate, because using any 128-bit hash was more certain to give good results without any disadvantages.
On the other hand, for a hardware implementation it would make sense to determine whether a 64-bit CRC is good enough for a certain application.
The "expensive" trick is to go full Reed Solomon error correction if you really care about data (aka: RAID5, RAID6, etc. etc.)... which is built up from the same theory as CRCs except more comprehensive at errors (fulfilling error correction as well).
> If you want to decide which is the acceptable error detection length, the best way would be to make a simulation, with random data of the expected length to which various random errors having the expected number should be applied. Then you can test the alternative error detection codes to see whether they detect the errors or not.
If that error is uniformly random, then a 128-bit sum is optimal and no different than 128-bit hash.
The reason we have this debate at all, is because different groups of people make different assumptions about the distribution of errors.
EDIT: I think it is reasonable to assume that burst-errors would occur on a storage medium: perhaps per-sector (Hard Drive) or per-block (SSD). Reed Solomon when applied to RAID assumes that an entire hard drive fails all at once and all together, for example.
As such: the key when discussing checksums / error detection / error correction is figuring out exactly how errors occur, and what to do about them (aka: detect vs correct).
The more modern schools of thought are to force errors to look as random as possible (hash and LDPC), while older schools of thought are to analyze the nature of errors and focus on them (CRC and Reed Solomon)
> If that error is uniformly random, then a 128-bit sum is optimal and no different than 128-bit hash.
If the data you read is uniformly random then what you say is correct. But if there is a single error (bit/byte/word whatever) which is placed uniformly randomly and takes a uniformly random value then other codes are vastly stronger than using a hash.
While what you say is technically correct-- that a 'uniformly random error' means replacing the input with an entirely independent input, that isn't what most people would call an 'error' and isn't the kind of error that most applications of a check value would detect.
Also, it must be kept in mind that CRCs, i.e. the remainders of polynomial division, are not the only simple way to obtain predictable error properties.
A simple alternative is to evaluate the polynomial corresponding to the data at a certain point and use the value as the error detection code. The properties about short bursts are identical.
Polynomial evaluation is advantageous at large hash sizes instead of CRC, because it matters very little at which point you choose to evaluate the polynomial, as long as it is non-null and non-unit.
With CRCs, the choice of the primitive polynomial to be used as divisor is critical and you cannot choose just a random 128-bit sequence and hope that it happens to correspond to a primitive polynomial and also to one with good error detection properties.
For example, if you want guaranteed error-burst detection for a data size where the known 32-bit or 64-bit CRCs are inadequate, you may use the Poly1305 evaluation functions available in most cryptographic libraries for data authentication.
When used in cryptography, the point chosen for the polynomial evaluation must be secret, but for error detection it may be fixed and public and you get error detection properties equivalent with an 128-bit CRC.
An optimized 128-bit CRC might be a little better, but good luck to finding the right 128-bit primitive polynomial.
CRC's are many times faster when implemented in hardware, e.g. in a network interface card that checks the CRC of Ethernet data packets.
When computing a CRC in software, in a CPU that has instructions for both CRC and for SHA-1/SHA-256, e.g. a modern Intel/AMD CPU or a 64-bit ARM CPU, there remains a speed difference, but in most cases reading the data from the memory or from a SSD/HDD takes so much time that the difference in speed is negligible.
Moreover, until very recently most utilities that compute CRC's over files, e.g. UNIX cksum, did not use the special instructions for polynomial multiplication, but they used generic algorithms with precomputed tables.
The consequence was that for example most versions of cksum are much slower than the corresponding versions of sha1sum.