High-Level Design

The GPT family uses a decoder-only transformer architecture trained autoregressively on next-token prediction. Each token attends only to preceding tokens via causal masking, making the model naturally suited for text generation. The core insight of GPT-1 was that unsupervised pretraining on a large corpus followed by supervised fine-tuning produces strong task performance, a paradigm that subsequent versions scaled dramatically.

Key Components

  • Causal (masked) self-attention. Tokens can only attend to earlier positions, enforcing left-to-right generation order.
  • Learned positional embeddings. GPT-1, GPT-2, and GPT-3 use learned rather than sinusoidal positional-encoding, allowing the model to discover position representations during training.
  • Layer normalization. GPT-2 introduced pre-norm (layer norm before attention and feed-forward sub-layers), which became standard.
  • Byte-pair encoding tokenization. GPT-2 introduced a BPE tokenizer operating on bytes, removing the need for preprocessing or unknown tokens.

Variants

  • gpt-1 (2018, openai): 117M parameters, 12 layers. Demonstrated that generative pretraining followed by discriminative fine-tuning transfers well across tasks.
  • gpt-2 (2019): 1.5B parameters, 48 layers. Showed that scaling alone enables zero-shot task performance. Initially withheld due to misuse concerns.
  • gpt-3 (2020): 175B parameters, 96 layers. Demonstrated in-context-learning as an emergent capability, where the model performs tasks from natural-language prompts without gradient updates.
  • GPT-4 (2023): multimodal (text and image input), frontier-level capabilities across reasoning, coding, and professional benchmarks. Architecture details not publicly disclosed.

Training Details

All GPT models are pretrained on large web-scraped text corpora using the language modeling objective (predict the next token). GPT-3 used a dataset blend of Common Crawl, WebText2, Books, and Wikipedia totaling ~300B tokens. Training follows scaling-laws, with larger models trained on proportionally more data.

Strengths and Weaknesses

Strengths. The decoder-only design is simple and scales efficiently. Autoregressive generation enables flexible, open-ended text production. In-context learning removes the need for task-specific fine-tuning in many settings.

Weaknesses. Causal attention means the model cannot condition on future context (unlike bert-architecture). Generation is sequential at inference time, limiting throughput. Large models require enormous compute for both training and serving.

Notable Models

The GPT lineage directly includes GPT-1 through GPT-4 from openai. The decoder-only autoregressive paradigm also underlies LLaMA, PaLM, Chinchilla, Claude, and most modern LLMs, making this the dominant architecture for generative language modeling.

Sources

  • Improving Language Understanding by Generative Pre-Training (File, URL)
  • Language Models are Unsupervised Multitask Learners (File, URL)
  • Language Models are Few-Shot Learners (File, DOI)
  • GPT-4 Technical Report (File, DOI)