Definition
Mixture of Experts (MoE) is an architecture paradigm where a model contains multiple “expert” sub-networks, and a routing mechanism selects a sparse subset of experts to process each input. This allows the total number of parameters to scale independently from the per-example computation cost.
Key Intuition
Not all inputs require the same processing. By maintaining a large pool of specialized experts and routing each input to only the most relevant ones, MoE achieves the capacity of a much larger dense model while using only a fraction of the compute per forward pass. This decouples model capacity from inference cost.
History/Origin
Jacobs et al. (1991) introduced the mixture of experts concept. Shazeer et al. (2017) applied MoE layers to LSTMs and transformers at Google, demonstrating scaling to 137B parameters. The switch-transformer (Fedus et al., 2021) simplified routing to a single expert per token, reducing communication costs and scaling to over 1 trillion parameters. Subsequent models like GLaM, ST-MoE, and Mixtral (Mistral AI, 2024) refined the approach for practical deployment.
Relationship to Other Concepts
MoE is a transformer variant that interacts with scaling-laws by enabling parameter scaling without proportional compute cost, though this complicates standard scaling analysis. It relates to mixture-of-experts-architecture as a design pattern. MoE models benefit from flash-attention and other efficiency techniques. Expert routing is conceptually related to attention as a form of learned content-dependent selection.
Notable Results
Switch Transformer scaled to 1.6T parameters while matching the training speed of a T5-Base/Large model. Mixtral 8x7B (2024) demonstrated that a 47B-parameter MoE model could match or exceed llama-2 70B on most benchmarks while using only 13B parameters per forward pass. GShard scaled MoE to 600B parameters for machine translation.
Open Questions
- Load balancing: ensuring all experts receive roughly equal utilization without sacrificing quality.
- Expert specialization: whether experts learn meaningful specializations or remain largely interchangeable.
- Serving efficiency: MoE models require all expert parameters in memory, creating deployment challenges despite lower compute per token.