High-Level Design
The encoder-decoder (or seq2seq) architecture processes an input sequence through an encoder network that produces a sequence of hidden representations, then feeds those representations to a decoder network that generates an output sequence autoregressively. This design naturally handles variable-length input-output mappings, making it the standard architecture for machine translation, summarization, and other sequence transduction tasks.
Key Components
- Encoder. Reads the full input sequence and produces contextualized representations. In RNN-based models, this is typically a bidirectional LSTM; in the transformer, it is a stack of self-attention and feed-forward layers.
- Decoder. Generates output tokens one at a time, conditioned on both its own previous outputs and the encoder representations. Uses causal masking to prevent attending to future output positions.
- Cross-attention. The mechanism connecting encoder and decoder. In the transformer, each decoder layer includes a cross-attention sub-layer where queries come from the decoder and keys/values come from the encoder output.
- attention. Bahdanau (2015) introduced additive attention over encoder hidden states, allowing the decoder to focus on relevant input positions at each generation step. This replaced the information bottleneck of compressing the entire input into a single fixed-length vector.
Variants
- RNN-based (seq2seq, Sutskever et al. 2014): bidirectional LSTM encoder, unidirectional LSTM decoder, fixed-length context vector. Limited by the information bottleneck.
- Attention-augmented RNN (bahdanau-attention, 2015): added soft alignment over encoder states, dramatically improving performance on long sequences.
- Transformer encoder-decoder (attention-is-all-you-need, 2017): replaced recurrence entirely with self-attention and cross-attention. The original transformer used this layout with 6 encoder and 6 decoder layers.
- t5 (2020, Google): cast all NLP tasks as text-to-text problems using a unified encoder-decoder transformer, pretrained with a span-corruption objective.
- BART (2019, Meta): denoising autoencoder combining a bidirectional encoder with an autoregressive decoder.
Training Details
Encoder-decoder models for translation are typically trained with teacher forcing, where the decoder receives the ground-truth previous token at each step. The transformer variant uses the Adam optimizer with warmup scheduling. T5 was pretrained on the C4 corpus with a span-corruption objective where random text spans are replaced with sentinel tokens.
Strengths and Weaknesses
Strengths. Clean separation of input understanding (encoder) and output generation (decoder). Cross-attention provides a principled mechanism for the decoder to selectively access input information. Well-suited for tasks with distinct input and output sequences.
Weaknesses. More complex than decoder-only models, requiring both encoder and decoder forward passes. For pure language modeling, decoder-only architectures (gpt) have proven more parameter-efficient. The encoder-decoder split is unnecessary when input and output share the same modality and format.
Notable Models
The encoder-decoder design underlies the original transformer, t5, BART, mBART, and many machine translation systems. While decoder-only models now dominate general-purpose language modeling, encoder-decoder architectures remain strong for structured generation tasks like translation and summarization.