Err, you are just restating what I’m saying, without addressing the concerns.
1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)
2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?
Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.
Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!
1 - is it fair to use ram in two places and report only one of them without any asterisk? (If you think this is fair-oh boy wait till you hear about my 0GB hbm use inference algorithm)
2 - i know how subchannel quantization works. Are they hitting those reported latency numbers with per layer cpu pingpong to rescale?