For me, the most interesting part is the statistics on all the models. These sho... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

Loic on March 17, 2023 | parent | context | favorite | on: TextSynth Server

For me, the most interesting part is the statistics on all the models. These show that 8 bit quantization is basically as good as the full model and 4 bit is very close. This is the first time I see such table across a large number of models in one place.

recuter on March 17, 2023 [–]

Pretty much.

Llama specific:

https://github.com/qwopqwop200/GPTQ-for-LLaMa

> According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...

https://docs.google.com/document/d/1wZ0g9rHI-6s7ctNlykuK4W5T...

Expect to get away with a factor of 4-5 reduction in memory usage for a minimal loss of quality. :)

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact