Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.

Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).



Interesting, and it kinda makes sense. You quantize, which invariably means you lose some precision, but then you can finetune post-quantization to recover at least some of it. Neat idea.


Which is itself a little counterintuitive as the arxiv papers they cite say models need to be pretrained from the ground up using 1- or 2-bit (or 1.58bit). It definitely adds some interesting data points for the open source community who are experimenting in every possible direction.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: