The Road to Modern LLMs

This wiki traces the canonical research lineage from early neural language representations to today’s state-of-the-art large language models. The story unfolds across roughly four eras.

Foundations (2013-2016)

The modern LLM era begins with word-embeddings. word2vec (2013) showed that simple neural networks could learn dense vector representations capturing semantic relationships. seq2seq (2014) introduced the encoder-decoder framework for sequence transduction with RNNs. bahdanau-attention (2015) solved the fixed-length bottleneck by introducing the attention mechanism, allowing models to dynamically focus on relevant input positions.

The Transformer Revolution (2017-2019)

attention-is-all-you-need (2017) replaced recurrence entirely with self-attention, creating the transformer architecture. This enabled massive parallelization during training. Two pretraining paradigms emerged: gpt-1 (2018) showed that autoregressive pretraining on unlabeled text followed by fine-tuning yielded strong NLU performance, while bert (2018) demonstrated bidirectional pretraining via masked language modeling. gpt-2 (2019) revealed that scaling the gpt architecture produced emergent zero-shot capabilities without any fine-tuning.

Scaling Era (2020-2021)

scaling-laws-neural-lm (2020) formalized the power-law relationship between compute, data, model size, and loss. gpt-3 (2020) demonstrated in-context-learning at 175B parameters, performing tasks from mere prompts. t5 (2020) unified all NLP tasks as text-to-text. The infrastructure for scaling advanced with switch-transformer (2021), which used mixture-of-experts to scale to trillions of parameters with constant per-example compute, and roformer-rope (2021), whose rotary positional-encoding became standard. lora (2021) made fine-tuning accessible by introducing low-rank-adaptation.

Alignment and Deployment (2022-2023)

The focus shifted from raw capability to usability and safety. instructgpt (2022) introduced rlhf to align models with human intent, showing a 1.3B aligned model could outperform 175B base GPT-3. chinchilla (2022) revised scaling-laws, arguing for equal scaling of data and parameters. chain-of-thought-paper (2022) unlocked reasoning via chain-of-thought prompting. flash-attention-paper (2022) solved the transformer’s O(n^2) memory bottleneck with IO-aware flash-attention. constitutional-ai-paper (2022) proposed constitutional-ai as an alternative to human-labeled alignment data. llama (2023) democratized LLM research by releasing competitive open-weight models. gpt-4-technical-report (2023) pushed the frontier with multimodal capabilities and predictable scaling. dpo (2023) simplified alignment with direct-preference-optimization, eliminating the need for reinforcement learning entirely.

Key Themes

  • Scale as a driver of capability: Many abilities (in-context learning, chain-of-thought reasoning) emerge only at sufficient scale.
  • The pretraining paradigm: Unsupervised pretraining followed by task adaptation has been the dominant approach since 2018.
  • Alignment as a distinct challenge: Raw capability is insufficient; RLHF, Constitutional AI, and DPO represent successive approaches to making models helpful and safe.
  • Efficiency innovations: FlashAttention, MoE, LoRA, and RoPE made it practical to train and adapt ever-larger models.
  • Open vs. closed: The tension between open-weight releases (LLaMA) and proprietary models (GPT-4) shapes the field’s research dynamics.

Labs

The major contributors span openai (GPT series, scaling laws, RLHF), google-brain (Transformer, Word2Vec, T5, Switch Transformer), google-deepmind (Chinchilla), meta-ai (LLaMA), and anthropic (Constitutional AI).