High-Level Design
The GPT family uses a decoder-only transformer architecture trained autoregressively on next-token prediction. Each token attends only to preceding tokens via causal masking, making the model naturally suited for text generation. The core insight of GPT-1 was that unsupervised pretraining on a large corpus followed by supervised fine-tuning produces strong task performance, a paradigm that subsequent versions scaled dramatically.
Key Components
- Causal (masked) self-attention. Tokens can only attend to earlier positions, enforcing left-to-right generation order.
- Learned positional embeddings. GPT-1, GPT-2, and GPT-3 use learned rather than sinusoidal positional-encoding, allowing the model to discover position representations during training.
- Layer normalization. GPT-2 introduced pre-norm (layer norm before attention and feed-forward sub-layers), which became standard.
- Byte-pair encoding tokenization. GPT-2 introduced a BPE tokenizer operating on bytes, removing the need for preprocessing or unknown tokens.
Variants
- gpt-1 (2018, openai): 117M parameters, 12 layers. Demonstrated that generative pretraining followed by discriminative fine-tuning transfers well across tasks.
- gpt-2 (2019): 1.5B parameters, 48 layers. Showed that scaling alone enables zero-shot task performance. Initially withheld due to misuse concerns.
- gpt-3 (2020): 175B parameters, 96 layers. Demonstrated in-context-learning as an emergent capability, where the model performs tasks from natural-language prompts without gradient updates.
- GPT-4 (2023): multimodal (text and image input), frontier-level capabilities across reasoning, coding, and professional benchmarks. Architecture details not publicly disclosed.
Training Details
All GPT models are pretrained on large web-scraped text corpora using the language modeling objective (predict the next token). GPT-3 used a dataset blend of Common Crawl, WebText2, Books, and Wikipedia totaling ~300B tokens. Training follows scaling-laws, with larger models trained on proportionally more data.
Strengths and Weaknesses
Strengths. The decoder-only design is simple and scales efficiently. Autoregressive generation enables flexible, open-ended text production. In-context learning removes the need for task-specific fine-tuning in many settings.
Weaknesses. Causal attention means the model cannot condition on future context (unlike bert-architecture). Generation is sequential at inference time, limiting throughput. Large models require enormous compute for both training and serving.
Notable Models
The GPT lineage directly includes GPT-1 through GPT-4 from openai. The decoder-only autoregressive paradigm also underlies LLaMA, PaLM, Chinchilla, Claude, and most modern LLMs, making this the dominant architecture for generative language modeling.