Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

llama3 narrowly beats arctic at SQL generation (80.2 vs 79.0) and Mixtral 8x22B scored 79.2.

You'd think SQL would be the one thing they'd be sure to smoke other models on.

0 - https://www.snowflake.com/blog/arctic-open-efficient-foundat...



Actually, Snowflake doesn’t use Arctic for SQL codegen internally. They use a different model chained with mistral-large… and they do smoke the competition. https://medium.com/snowflake/1-1-3-how-snowflake-and-mistral...



Yeah but that's a 70B model. You can see on the Inference Efficiency chart that it takes more than 3x as much compute to run it compared to this one.


Most people are vram constrained not compute constrained.


But those people usually have more system RAM than VRAM.

At those scales, most people become bandwidth and compute constrained using CPU inference instead of multiple GPUs. In those cases, an MOE with a low number of active parameters is the fastest.


Cloud providers aren’t though.


But you do need to hold all 128 experts in memory? Or not?

Or they simply consider inference efficiency as latency


Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note: 1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!) 2. Small number of activated parameters shine in batch inference case for cloud providers.


> 2. Small number of activated parameters shine in batch inference case for cloud providers

Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.


That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time.

What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?


1 H100 is only 80GB of HBM. I guess you mean a server with 4xH100 is 1 node?


this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.

Which is ... a lot to say the least.

And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.


Thanks for the correction, there are indeed 8x nodes. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h10...


I believe the main draw of the MoE model is they don't all need to be in memory at once. They can be swapped based on context. In aggregate you get the performance of a much larger model (384b tokens) while using much less memory than such a model would require. If you had enough memory it could all be loaded but it doesn't need to be.


Technically you could, but it would take much longer to do all that swapping.


"Expert" in MoE has no bearing on what you might think of as a human expert.

It's not like there is one expert that is proficient at science, and one that is proficient in history.

For a given inference request, you're likely to activate all the experts at various points. But for each individual forward pass (e.g. each token), you are only activating a few.


Wrong. MoE models like this one usually chose a different and unpredictable mix of experts for each token, and as such you need all parameters at memory at once.

It lessens the number of parameters that need to be moved from memory to compute chip for each token, not from disk to memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: