Err, you are just restating what I’m saying, without addressing the concerns. 1 ...

danielhanchen · on March 30, 2024

Oh sorry - did not expect to restate what you said whoops - all train of thought!

You can see from https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_1bi... that the model disk space is 3GB + 100MB of LoRA weights. I also uploaded a 4bit one to https://huggingface.co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.

So because of the offloading trick, the GPU VRAM is less, but in actuality, still 3GB is needed.

Unsure on latency sadly

mobicham · on March 31, 2024

Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.

danielhanchen · on April 1, 2024

Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!