Hacker News new | past | comments | ask | show | jobs | submit login

Can you share at a high level how you run this model?

We know it’s 671B params with each MOE node at 37B…

If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU?

How much do interconnects hurt performance vs being able to load the model into a single GPU?






Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.

There are two ways we can run it:

- 8xH200 GPU == 8x141GB == 1128 GB VRAM

- 16xH100 GPU == 8x80GB == 1280 GB VRAM

Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.

More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.


> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size

In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: