But those people usually have more system RAM than VRAM.
At those scales, most people become bandwidth and compute constrained using CPU inference instead of multiple GPUs. In those cases, an MOE with a low number of active parameters is the fastest.
Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note:
1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!)
2. Small number of activated parameters shine in batch inference case for cloud providers.
> 2. Small number of activated parameters shine in batch inference case for cloud providers
Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.
this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.
Which is ... a lot to say the least.
And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.
I believe the main draw of the MoE model is they don't all need to be in memory at once. They can be swapped based on context. In aggregate you get the performance of a much larger model (384b tokens) while using much less memory than such a model would require. If you had enough memory it could all be loaded but it doesn't need to be.
"Expert" in MoE has no bearing on what you might think of as a human expert.
It's not like there is one expert that is proficient at science, and one that is proficient in history.
For a given inference request, you're likely to activate all the experts at various points. But for each individual forward pass (e.g. each token), you are only activating a few.
Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.
It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.
You'd think SQL would be the one thing they'd be sure to smoke other models on.
0 - https://www.snowflake.com/blog/arctic-open-efficient-foundat...