We used CPU intrinsics in version 9 of GNU coreutils, for the cksum utility. From the release notes:
"cksum [-a crc] is now up to 4 times faster by using a slice by 8 algorithm, and at least 8 times faster where pclmul instructions are supported."
Implementing that portably is a bit tricky, as one must consider:
- support various compilers which may not support intrinsics
- runtime checks to see if the current CPU supports the instructions
- ensure compiler options enabling the instructions are restricted to their own lib to ensure the don't leak into unprotected code.
- automake requires using a separate lib for this rather than just a separate compilation unit
The CRC is a polynomial / Galois field, and today's CPUs have polynomial multiply (aka: pmul on ARM), or carryless multiply (aka: PCLMULQDQ on x86). These instructions can implement the "tough" part of the CRC in just one clock tick.
the C code looks really nice and very generic, and, it being a generator... would it be too difficult to make use of those CPU instructions during the code generation? (then it could even create architecture-specific code, which looks like a plus to me)