Can you share at a high level how you run this model? We know it’s 671B params w...

philipkiely · 2025-01-29T19:25:54 1738178754

Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole.

There are two ways we can run it:

- 8xH200 GPU == 8x141GB == 1128 GB VRAM

- 16xH100 GPU == 8x80GB == 1280 GB VRAM

Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.

More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.

nv35 · 2025-01-29T23:39:59 1738193999

> Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size

In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.