Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Fedus, Zoph, and Shazeer (2022) from google-brain introduce the Switch Transformer, a simplified mixture-of-experts architecture that routes each token to a single expert instead of the top-k experts used in prior MoE work. This simplification reduces routing complexity and communication costs while enabling models to scale to over a trillion parameters with constant computational cost per token.

Problem

Mixture-of-Experts models offer a path to scaling parameters without proportionally increasing compute, but prior approaches suffered from training instability, complex routing algorithms, and high communication overhead, hindering practical adoption.

Key Contribution

A simplified routing strategy where each token is sent to exactly one expert (rather than top-2 or more), combined with training stabilization techniques including selective precision (bfloat16 for most operations, float32 for the router), a load-balancing auxiliary loss, and improved initialization. These changes enable, for the first time, stable training of large sparse models in lower-precision formats.

Method

The Switch Transformer replaces the feed-forward layers in a T5-based encoder-decoder transformer with MoE layers, where a learned router assigns each token to one of N experts. Each expert is a standard feed-forward network. Models are designed off T5-Base and T5-Large, scaling up to 1.6 trillion parameters using combinations of expert, model, and data parallelism on the C4 corpus.

Main Results

Switch Transformers achieve up to 7x pre-training speedups over equivalent-compute T5 models on a step basis. A Switch-Base model with 128 experts matches T5-Base quality in 1/7th the training steps. At the largest scale, a 1.6 trillion parameter model achieves a 4x speedup over T5-XXL. On downstream tasks, Switch Transformers improve over T5 on SuperGLUE and multilingual benchmarks across all 101 languages in mC4. Distilled Switch models compress 37% of the quality gain into a dense model of the same size as T5-Base.

Limitations

Sparse models have large memory footprints despite constant FLOPs, complicating deployment. Fine-tuning gains are smaller than pre-training gains. Token dropping under heavy load imbalance can degrade quality. The routing mechanism lacks guarantees on expert utilization.

Impact

Switch Transformers demonstrated that MoE architectures could be practically scaled to trillions of parameters, influencing subsequent sparse models including Google’s GLaM and ST-MoE. The single-expert routing simplification became the standard approach in later MoE work and informed designs used in models like Mixtral.

Sources

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (File, DOI)

AI Research Wiki

Explorer