llama3 narrowly beats arctic at SQL generation (80.2 vs 79.0) and Mixtral 8x22B ...

adrien-treuille · on April 24, 2024

Actually, Snowflake doesn’t use Arctic for SQL codegen internally. They use a different model chained with mistral-large… and they do smoke the competition. https://medium.com/snowflake/1-1-3-how-snowflake-and-mistral...

mritchie712 · on April 24, 2024

smoke? it's the same as gpt4

https://medium.com/snowflake/1-1-3-how-snowflake-and-mistral...

sp332 · on April 24, 2024

Yeah but that's a 70B model. You can see on the Inference Efficiency chart that it takes more than 3x as much compute to run it compared to this one.

msp26 · on April 24, 2024

Most people are vram constrained not compute constrained.

Manabu-eo · on April 24, 2024

But those people usually have more system RAM than VRAM.

At those scales, most people become bandwidth and compute constrained using CPU inference instead of multiple GPUs. In those cases, an MOE with a low number of active parameters is the fastest.

kaibee · on April 24, 2024

Cloud providers aren’t though.

karmasimida · on April 24, 2024

But you do need to hold all 128 experts in memory? Or not?

Or they simply consider inference efficiency as latency

rajhans · on April 24, 2024

Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note: 1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!) 2. Small number of activated parameters shine in batch inference case for cloud providers.

kiratp · on April 24, 2024

> 2. Small number of activated parameters shine in batch inference case for cloud providers

Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.

karmasimida · on April 25, 2024

That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time.

What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?

kristianp · on April 25, 2024

1 H100 is only 80GB of HBM. I guess you mean a server with 4xH100 is 1 node?

karmasimida · on April 25, 2024

this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.

Which is ... a lot to say the least.

And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.

kristianp · on April 25, 2024

Thanks for the correction, there are indeed 8x nodes. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h10...

giantrobot · on April 24, 2024

I believe the main draw of the MoE model is they don't all need to be in memory at once. They can be swapped based on context. In aggregate you get the performance of a much larger model (384b tokens) while using much less memory than such a model would require. If you had enough memory it could all be loaded but it doesn't need to be.

sp332 · on April 24, 2024

Technically you could, but it would take much longer to do all that swapping.

qeternity · on April 24, 2024

"Expert" in MoE has no bearing on what you might think of as a human expert.

It's not like there is one expert that is proficient at science, and one that is proficient in history.

For a given inference request, you're likely to activate all the experts at various points. But for each individual forward pass (e.g. each token), you are only activating a few.

Manabu-eo · on April 24, 2024

Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.

It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.