Learning Paths
Two paths for building an understanding of LLM research. Both assume familiarity with data science fundamentals (statistics, ML, linear algebra, supervised learning) but no deep learning or NLP background. The lightweight path gets you to a working understanding in a weekend; the paper path provides full depth.
Lightweight Path
Videos, blog posts, and hands-on exercises. Roughly 10-15 hours total.
Step 1: Visual Intuition (~2 hours)
- 3Blue1Brown’s neural network series (YouTube) — builds visual intuition for how neural networks, attention, and transformers work from first principles. No jargon assumed. Covers the core ideas behind self-attention and the transformer.
- Andrej Karpathy’s “Intro to Large Language Models” (~1hr YouTube) — high-level tour of the full LLM picture: pretraining, fine-tuning, rlhf, and emergent capabilities. Good mental map before going deeper.
Step 2: Build a Transformer (~2 hours)
- Andrej Karpathy’s “Let’s build GPT from scratch” (~2hr YouTube) — codes a small gpt from an empty file. Covers tokenization, self-attention, the training loop, and text generation. This single video replaces reading seq2seq, bahdanau-attention, attention-is-all-you-need, gpt-1, bert, and gpt-2 for building intuition. Follow along in a notebook if you can.
Step 3: Key Blog Posts (~4 hours reading)
- Jay Alammar’s “The Illustrated Transformer” — visual step-by-step walkthrough of the transformer architecture with diagrams for every component (embeddings, positional-encoding, self-attention, feed-forward layers, residual connections).
- Lilian Weng’s blog posts (lilianweng.github.io):
- “The Transformer Family” — covers the architecture and its variants (gpt, bert-architecture, encoder-decoder)
- “Prompt Engineering” — covers in-context-learning and chain-of-thought
- “RLHF” — covers the full alignment pipeline (rlhf, reward modeling, PPO) and connects to instructgpt
- Chip Huyen’s writing on LLMOps — bridges research and production; covers evaluation, serving, and fine-tuning from a practical ML engineering perspective.
Step 4: Hands-On (~4 hours)
- Hugging Face NLP Course (free, huggingface.co/learn/nlp-course) — walks through tokenization, using pretrained transformers, and fine-tuning with code. Familiar notebook-based workflow.
- Run a LoRA fine-tune — use the Hugging Face PEFT library to fine-tune a small open model (e.g., Qwen-2.5 or LLaMA 3) on a task you care about. This makes low-rank-adaptation, pretraining, and fine-tuning concrete. ~2-3 hours to get a working result.
Step 5: Deepen Selectively
Use the wiki concept and paper pages as reference. When a topic from the lightweight path interests you, read the corresponding paper for full depth. The paper path below provides a suggested order.
Paper Path
Reading the canonical papers directly. Roughly 40-60 hours total. Organized in five phases; within each phase the order matters.
Phase 1: Core Architecture
Read closely. These papers are short and self-contained.
- seq2seq — encoder-decoder intuition with RNNs; the problem setup that motivates everything after
- bahdanau-attention — introduces the attention mechanism to solve the fixed-length bottleneck
- attention-is-all-you-need — the transformer; the most important single paper in the lineage. Spend time on Section 3 (model architecture)
Phase 2: The Pretraining Paradigm
Read selectively. The architectural innovations are small; the key ideas are the training objectives and what they unlock.
- gpt-1 — short paper; pretrain + fine-tune on a transformer decoder. Connects to transfer-learning you already know
- bert — contrast with GPT: same architecture, bidirectional training objective (masked LM). Dominated NLU benchmarks
- gpt-2 — skim; the key insight is that scale unlocks zero-shot ability without any fine-tuning
Phase 3: Scaling
Focus on the ideas, skim the extensive experiments.
- scaling-laws-neural-lm — power-law relationships between compute, data, and loss. The statistical framing will feel natural. Key takeaway: predictable returns on compute
- chinchilla — revises the above; data matters as much as parameters. Changed how labs allocate training budgets
- gpt-3 — read the intro and Sections 1-2. The few-shot evaluation framework (in-context-learning) is the contribution, not the architecture
Phase 4: Making Models Useful
Read closely. This is where “base model” becomes “ChatGPT.”
- instructgpt — the rlhf pipeline in three concrete steps: SFT, reward model, PPO. A 1.3B aligned model beats 175B base GPT-3
- dpo — simplifies RLHF to a classification loss via reparameterization. Short paper, clean math
- constitutional-ai-paper — replaces human labels with AI feedback (constitutional-ai); a different philosophy of alignment
Phase 5: Efficiency
Skim for intuition. These are more specialized.
- lora — low-rank-adaptation of weight matrices; you’ll appreciate the linear algebra (rank decomposition of weight updates, ~10,000x fewer trainable parameters)
- flash-attention-paper — hardware-aware algorithm design for self-attention; the IO-complexity argument is the key insight
- roformer-rope — positional-encoding via rotation matrices; elegant mathematical formulation
Papers to Skip Initially
- word2vec — you likely already know embeddings from data science work
- t5 — comprehensive but very long; its contribution (text-to-text framing) is covered by the blog posts
- switch-transformer — mixture-of-experts is a specialization; come back if you encounter MoE models later