Mixture of Experts (MoE)

High-Level Design

The Mixture of Experts (MoE) architecture is a sparse model design where each input token is routed to a small subset of specialized “expert” sub-networks rather than passing through the entire model. This decouples total parameter count from per-example compute: a model can have hundreds of billions of parameters while activating only a fraction for each token. The approach builds on mixture-of-experts concepts dating back to Jacobs et al. (1991) but has been adapted for modern transformer-based language models.

Key Components

Expert networks. Each expert is typically a feed-forward network with the same architecture as the standard transformer FFN. In a given layer, multiple experts exist in parallel, each potentially specializing in different types of inputs.
Router (gating network). A learned routing function that assigns each token to one or more experts. The router produces a probability distribution over experts and selects the top-k for each token.
Capacity factor. Controls how many tokens each expert can process per batch. Setting the capacity factor too low causes token dropping; too high wastes compute on padding.
Load balancing loss. An auxiliary loss term that encourages the router to distribute tokens evenly across experts, preventing collapse where a few experts handle most traffic.

Variants

Sharded MoE (Shazeer et al., 2017): top-2 expert routing with noisy gating, applied to LSTM language models.
switch-transformer (Fedus et al., 2021): simplified routing to a single expert per token, reducing communication costs and improving training stability. Achieved 4-7x speedups over dense T5 baselines at equivalent compute.
GShard (2020, Google): scaled MoE to 600B parameters for machine translation using top-2 routing.
Mixtral (2024, Mistral AI): 8 experts per layer with top-2 routing, 46.7B total parameters with 12.9B active. Matched or exceeded LLaMA 2 70B while using far less inference compute.

Training Details

MoE models are trained similarly to dense transformers but require careful tuning of the load balancing loss coefficient, capacity factor, and expert count. Training instability is a known challenge: experts can collapse to handling similar inputs, or the router can degenerate. The switch-transformer addressed instability with selective precision (float32 for the router, bfloat16 elsewhere) and simplified single-expert routing.

Strengths and Weaknesses

Strengths. Dramatically better scaling efficiency: total parameters grow without proportional compute increase. Enables training larger models within fixed compute budgets. Follows favorable scaling-laws when measured against compute-matched dense baselines.

Weaknesses. Higher memory requirements since all expert parameters must be loaded even though only a subset is active. Communication overhead in distributed settings when tokens must be routed across devices. Training instability and load imbalance remain practical challenges. Expert specialization is not well understood.

Notable Models

The switch-transformer demonstrated MoE at scale in a transformer framework. Mixtral showed MoE can produce competitive open-weight models. GPT-4 is widely rumored to use an MoE architecture, though this has not been officially confirmed. The approach is increasingly common in frontier LLMs seeking to maximize capability per unit of inference compute.

Sources

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (File, DOI)

AI Research Wiki

Explorer