Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, and Bengio (2015) introduce the attention mechanism for neural machine translation, allowing the decoder to dynamically focus on different parts of the source sentence at each generation step rather than relying on a single fixed-length vector. This addressed the key bottleneck of the seq2seq encoder-decoder architecture and became one of the most influential innovations in deep learning.

Problem

Existing encoder-decoder models for neural machine translation compress the entire source sentence into a single fixed-length vector. This bottleneck causes performance to degrade rapidly on longer sentences, as the model cannot preserve all necessary information in a fixed representation.

Key Contribution

A soft alignment mechanism (later called “attention”) that computes a weighted combination of encoder hidden states at each decoding step. Rather than encoding the source into one vector, the model encodes it as a sequence of annotation vectors (using a bidirectional RNN) and lets the decoder learn which annotations are most relevant for each target word.

Method

The encoder is a bidirectional RNN that produces an annotation vector for each source position by concatenating forward and backward hidden states. At each decoding step, the decoder computes alignment scores between its current hidden state and all encoder annotations using a learned feedforward network. These scores are normalized via softmax to produce attention weights, which are used to compute a context vector as a weighted sum of annotations. The context vector, combined with the previous target word and decoder state, conditions the prediction of the next target word.

Main Results

On English-to-French translation (WMT’14), the attention-based model (RNNsearch-50) achieved comparable performance to the state-of-the-art phrase-based SMT system, substantially outperforming the basic encoder-decoder (RNNencdec). The improvement was especially pronounced on longer sentences, where the basic model’s performance collapsed. Qualitative analysis showed that learned alignments closely matched human-intuitive word correspondences.

Limitations

The additive attention mechanism introduces quadratic complexity in sequence length (each decoder step attends to all encoder positions). The model still uses RNNs, limiting parallelism during training. The bidirectional encoder adds computational cost.

Impact

This paper introduced the attention mechanism that became ubiquitous in NLP and beyond. It directly inspired self-attention and the transformer architecture (attention-is-all-you-need), which replaced the RNN components entirely with attention. The concept of soft alignment generalized to computer vision (image captioning, object detection), speech recognition, and virtually every area of deep learning.

Sources

Neural Machine Translation by Jointly Learning to Align and Translate (File, DOI)

AI Research Wiki

Explorer