Learning Paths

Two paths for building an understanding of LLM research. Both assume familiarity with data science fundamentals (statistics, ML, linear algebra, supervised learning) but no deep learning or NLP background. The lightweight path gets you to a working understanding in a weekend; the paper path provides full depth.

Lightweight Path

Videos, blog posts, and hands-on exercises. Roughly 10-15 hours total.

Step 1: Visual Intuition (~2 hours)

3Blue1Brown’s neural network series (YouTube) — builds visual intuition for how neural networks, attention, and transformers work from first principles. No jargon assumed. Covers the core ideas behind self-attention and the transformer.
Andrej Karpathy’s “Intro to Large Language Models” (~1hr YouTube) — high-level tour of the full LLM picture: pretraining, fine-tuning, rlhf, and emergent capabilities. Good mental map before going deeper.

Step 2: Build a Transformer (~2 hours)

Andrej Karpathy’s “Let’s build GPT from scratch” (~2hr YouTube) — codes a small gpt from an empty file. Covers tokenization, self-attention, the training loop, and text generation. This single video replaces reading seq2seq, bahdanau-attention, attention-is-all-you-need, gpt-1, bert, and gpt-2 for building intuition. Follow along in a notebook if you can.

Step 3: Key Blog Posts (~4 hours reading)

Jay Alammar’s “The Illustrated Transformer” — visual step-by-step walkthrough of the transformer architecture with diagrams for every component (embeddings, positional-encoding, self-attention, feed-forward layers, residual connections).
Lilian Weng’s blog posts (lilianweng.github.io):
- “The Transformer Family” — covers the architecture and its variants (gpt, bert-architecture, encoder-decoder)
- “Prompt Engineering” — covers in-context-learning and chain-of-thought
- “RLHF” — covers the full alignment pipeline (rlhf, reward modeling, PPO) and connects to instructgpt
Chip Huyen’s writing on LLMOps — bridges research and production; covers evaluation, serving, and fine-tuning from a practical ML engineering perspective.

Step 4: Hands-On (~4 hours)

Hugging Face NLP Course (free, huggingface.co/learn/nlp-course) — walks through tokenization, using pretrained transformers, and fine-tuning with code. Familiar notebook-based workflow.
Run a LoRA fine-tune — use the Hugging Face PEFT library to fine-tune a small open model (e.g., Qwen-2.5 or LLaMA 3) on a task you care about. This makes low-rank-adaptation, pretraining, and fine-tuning concrete. ~2-3 hours to get a working result.

Step 5: Deepen Selectively

Use the wiki concept and paper pages as reference. When a topic from the lightweight path interests you, read the corresponding paper for full depth. The paper path below provides a suggested order.

Paper Path

Reading the canonical papers directly. Roughly 40-60 hours total. Organized in five phases; within each phase the order matters.

Phase 1: Core Architecture

Read closely. These papers are short and self-contained.

seq2seq — encoder-decoder intuition with RNNs; the problem setup that motivates everything after
bahdanau-attention — introduces the attention mechanism to solve the fixed-length bottleneck
attention-is-all-you-need — the transformer; the most important single paper in the lineage. Spend time on Section 3 (model architecture)

Phase 2: The Pretraining Paradigm

Read selectively. The architectural innovations are small; the key ideas are the training objectives and what they unlock.

gpt-1 — short paper; pretrain + fine-tune on a transformer decoder. Connects to transfer-learning you already know
bert — contrast with GPT: same architecture, bidirectional training objective (masked LM). Dominated NLU benchmarks
gpt-2 — skim; the key insight is that scale unlocks zero-shot ability without any fine-tuning

Phase 3: Scaling

Focus on the ideas, skim the extensive experiments.

scaling-laws-neural-lm — power-law relationships between compute, data, and loss. The statistical framing will feel natural. Key takeaway: predictable returns on compute
chinchilla — revises the above; data matters as much as parameters. Changed how labs allocate training budgets
gpt-3 — read the intro and Sections 1-2. The few-shot evaluation framework (in-context-learning) is the contribution, not the architecture

Phase 4: Making Models Useful

Read closely. This is where “base model” becomes “ChatGPT.”

instructgpt — the rlhf pipeline in three concrete steps: SFT, reward model, PPO. A 1.3B aligned model beats 175B base GPT-3
dpo — simplifies RLHF to a classification loss via reparameterization. Short paper, clean math
constitutional-ai-paper — replaces human labels with AI feedback (constitutional-ai); a different philosophy of alignment

Phase 5: Efficiency

Skim for intuition. These are more specialized.

lora — low-rank-adaptation of weight matrices; you’ll appreciate the linear algebra (rank decomposition of weight updates, ~10,000x fewer trainable parameters)
flash-attention-paper — hardware-aware algorithm design for self-attention; the IO-complexity argument is the key insight
roformer-rope — positional-encoding via rotation matrices; elegant mathematical formulation

Papers to Skip Initially

word2vec — you likely already know embeddings from data science work
t5 — comprehensive but very long; its contribution (text-to-text framing) is covered by the blog posts
switch-transformer — mixture-of-experts is a specialization; come back if you encounter MoE models later

AI Research Wiki

Explorer

Learning Paths

Learning Paths

Lightweight Path

Step 1: Visual Intuition (~2 hours)

Step 2: Build a Transformer (~2 hours)

Step 3: Key Blog Posts (~4 hours reading)

Step 4: Hands-On (~4 hours)

Step 5: Deepen Selectively

Paper Path

Phase 1: Core Architecture

Phase 2: The Pretraining Paradigm

Phase 3: Scaling

Phase 4: Making Models Useful

Phase 5: Efficiency

Papers to Skip Initially

Graph View

Table of Contents

Backlinks