Not a stupid question at all. Early quantization approaches just quantized a hig...

Not a stupid question at all.

Early quantization approaches just quantized a high-bit-precision pre-trained model - no need to calculate gradients on the quantized weights. BitNet[1] changed the game by training a low-precision model from scratch. It achieves this by keeping high precision in the gradients, optimizer state, and in "latent weights" which are then quantized on the fly. I don't really understand the finer details of how this works, so check out the paper if you're interested.

This article's approach is interesting. They start by quantizing a pre-trained high-precision model, and then they fine-tune the quantized model using LoRA (which dramatically improves the performance of the quantized model). They don't talk about the bit depth of the values in the LoRA matrices, so it may be that those are higher bit-depth.

[1] https://arxiv.org/pdf/2310.11453.pdf