Interesting. I had assumed the performance advantage for MoE came from minimisin... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		phire 10 months ago \| parent \| context \| favorite \| on: Run DeepSeek R1 Dynamic 1.58-bit Interesting. I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing. I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise. It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.

yorwba 10 months ago [–]

The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.

iamnotagenius 10 months ago | [–]

The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact