I had a similar confusion previously, so maybe I can help. I used to think that ...

I had a similar confusion previously, so maybe I can help. I used to think that a mixture of experts model meant that you had like 8 separate parallel models, and you would decide at inference time which one to route to. This is not the case, the mixture happens at a much smaller scale.

Instead, the mixture of experts exists within individual layers. Suppose we want to have a big feed-forward layer that takes as input a 1024-element vector, has a hidden size of 8096, and an output size of 1024. We carve up that 8096 hidden layer into 8 1024-sized chunks (this does not have to be the same size as the input). Whenever an input arrives at this layer, a routing function determines which of those 1024-sized chunks should serve as the hidden layer. Every token within a single prompt/response can choose a different chunk when it is processed by this layer, and every layer can have a different routing decision. So if I have 100 layers, each of which has 8 experts, there are 8^100 possible different paths that an individual token could take through the network.