Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Err, you are just restating what I’m saying, without addressing the concerns.

1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)

2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?



Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly


Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.


Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: