Sutskever, Vinyals, and Le (2014) demonstrate that a deep LSTM encoder-decoder architecture can directly map variable-length input sequences to variable-length output sequences, achieving competitive machine translation quality without task-specific engineering. This work established the sequence-to-sequence paradigm that became the foundation for neural machine translation and later influenced the transformer architecture.
Problem
Standard deep neural networks require fixed-dimensionality inputs and outputs, making them unable to handle sequence transduction tasks (e.g., machine translation) where input and output lengths vary and have complex, non-monotonic relationships.
Key Contribution
An end-to-end encoder-decoder framework using two separate deep LSTMs: one encodes the source sequence into a fixed-dimensional vector, and the other decodes that vector into the target sequence. A key technical insight is that reversing the source sentence order substantially improves performance by introducing short-term dependencies between corresponding source and target words.
Method
The encoder LSTM reads the input sequence (in reversed word order) and produces a fixed-dimensional hidden state. The decoder LSTM, initialized with this state, generates the output sequence token by token using beam search. The model uses 4-layer deep LSTMs with 1,000-dimensional hidden states and 160,000-word source vocabulary / 80,000-word target vocabulary. An ensemble of 5 LSTMs (384M parameters each) was used for best results.
Main Results
On WMT’14 English-to-French, the ensemble achieved 34.81 BLEU, outperforming a phrase-based SMT baseline (33.30 BLEU). When used to rescore the SMT system’s 1000-best list, BLEU rose to 36.5, approaching the previous best of 37.0. Reversing source sentences improved BLEU by several points. The model handled long sentences well despite the fixed-length bottleneck.
Limitations
Compressing the entire source sentence into a single fixed-length vector creates an information bottleneck, particularly for long sentences. The model uses a limited vocabulary (80k words), penalizing translations with out-of-vocabulary words. Decoding is sequential and slow.
Impact
This paper established the encoder-decoder paradigm for neural sequence transduction and directly motivated the bahdanau-attention mechanism to address the fixed-length bottleneck. The seq2seq framework was later applied to summarization, dialogue, code generation, and many other tasks, ultimately evolving into the transformer-based architectures that dominate today.