Sounds interesting, can you explain the conceptual differences between your appr...

danlenton · on May 23, 2024

Sure! Basically traditional MoE has several linear layers, and the network learns to route down those paths, based on the training loss (similar to how CNNs learn through max-pooling, which is also non-differentiable). However, MoEs have been shown to specialiaze on tokens, not high-level semantics. This was eloquently explained by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!