So this misses a few aspects of why the method works:
- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.
- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.
- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.
- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.
- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.
- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.