AI Research Wiki Index

Guides

  • learning-paths — two approaches to learning LLM research: lightweight (videos/blogs/hands-on) and paper-based

Concepts

Representation and Attention

  • word-embeddings — dense vector representations of words (Word2Vec, GloVe)
  • attention — mechanism for dynamically focusing on relevant input positions
  • self-attention — attention within a single sequence; core of the transformer
  • positional-encoding — injecting sequence order into transformers (sinusoidal, RoPE, ALiBi)

Training Paradigms

  • transfer-learning — using knowledge from pretraining for downstream tasks
  • pretraining — unsupervised training on large corpora before task adaptation
  • fine-tuning — adapting pretrained models to specific tasks
  • low-rank-adaptation — parameter-efficient fine-tuning via low-rank weight decomposition (LoRA)

Scaling and Efficiency

  • scaling-laws — power-law relationships between compute, data, model size, and loss
  • flash-attention — IO-aware exact attention algorithm with O(N) memory
  • mixture-of-experts — sparse routing to expert sub-networks for efficient scaling

Emergent Capabilities

Alignment

Architectures

  • transformer — self-attention-based architecture replacing recurrence (Vaswani et al. 2017)
  • gpt — decoder-only autoregressive transformer family (GPT-1 through GPT-4)
  • bert-architecture — encoder-only bidirectional transformer with masked LM pretraining
  • encoder-decoder — sequence-to-sequence architecture from RNNs to transformers
  • mixture-of-experts-architecture — sparse expert routing for scaling parameters without proportional compute

Entities

  • openai — GPT series, scaling laws, RLHF/InstructGPT
  • google-brain — Transformer, Word2Vec, T5, Switch Transformer (merged into DeepMind 2023)
  • google-deepmind — Chinchilla scaling laws, Gemini, AlphaFold
  • meta-ai — LLaMA open-weight model family, PyTorch
  • anthropic — Constitutional AI, Claude, alignment research

Papers

  • word2vec — Efficient Estimation of Word Representations in Vector Space (Mikolov et al. 2013)
  • seq2seq — Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
  • bahdanau-attention — Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2015)
  • attention-is-all-you-need — Attention Is All You Need (Vaswani et al. 2017)
  • gpt-1 — Improving Language Understanding by Generative Pre-Training (Radford et al. 2018)
  • bert — BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
  • gpt-2 — Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
  • t5 — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al. 2020)
  • scaling-laws-neural-lm — Scaling Laws for Neural Language Models (Kaplan et al. 2020)
  • gpt-3 — Language Models are Few-Shot Learners (Brown et al. 2020)
  • switch-transformer — Switch Transformers: Scaling to Trillion Parameter Models (Fedus et al. 2021)
  • roformer-rope — RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021)
  • lora — LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
  • chain-of-thought-paper — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
  • instructgpt — Training Language Models to Follow Instructions with Human Feedback (Ouyang et al. 2022)
  • chinchilla — Training Compute-Optimal Large Language Models (Hoffmann et al. 2022)
  • flash-attention-paper — FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022)
  • constitutional-ai-paper — Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)
  • llama — LLaMA: Open and Efficient Foundation Language Models (Touvron et al. 2023)
  • gpt-4-technical-report — GPT-4 Technical Report (OpenAI 2023)
  • dpo — Direct Preference Optimization (Rafailov et al. 2023)

Applications