The paper cited is about hardware, where there is no accuracy tradeoff because you control the numerical precision completely and use fixed point. In a software implementation, neither is true. There is no chance that you will get the exact same values out of this method that you do out of other FP matmuls.
This repository contains the source code for ML hardware architectures that
require nearly half the number of multiplier units to achieve the same
performance, by executing alternative inner-product algorithms that trade
nearly half the multiplications for cheap low-bitwidth additions, while still
producing identical output as the conventional inner product.