AI Research Wiki Index
Guides
- learning-paths — two approaches to learning LLM research: lightweight (videos/blogs/hands-on) and paper-based
Concepts
Representation and Attention
- word-embeddings — dense vector representations of words (Word2Vec, GloVe)
- attention — mechanism for dynamically focusing on relevant input positions
- self-attention — attention within a single sequence; core of the transformer
- positional-encoding — injecting sequence order into transformers (sinusoidal, RoPE, ALiBi)
Training Paradigms
- transfer-learning — using knowledge from pretraining for downstream tasks
- pretraining — unsupervised training on large corpora before task adaptation
- fine-tuning — adapting pretrained models to specific tasks
- low-rank-adaptation — parameter-efficient fine-tuning via low-rank weight decomposition (LoRA)
Scaling and Efficiency
- scaling-laws — power-law relationships between compute, data, model size, and loss
- flash-attention — IO-aware exact attention algorithm with O(N) memory
- mixture-of-experts — sparse routing to expert sub-networks for efficient scaling
Emergent Capabilities
Alignment
Architectures
- transformer — self-attention-based architecture replacing recurrence (Vaswani et al. 2017)
- gpt — decoder-only autoregressive transformer family (GPT-1 through GPT-4)
- bert-architecture — encoder-only bidirectional transformer with masked LM pretraining
- encoder-decoder — sequence-to-sequence architecture from RNNs to transformers
- mixture-of-experts-architecture — sparse expert routing for scaling parameters without proportional compute
Entities
- openai — GPT series, scaling laws, RLHF/InstructGPT
- google-brain — Transformer, Word2Vec, T5, Switch Transformer (merged into DeepMind 2023)
- google-deepmind — Chinchilla scaling laws, Gemini, AlphaFold
- meta-ai — LLaMA open-weight model family, PyTorch
- anthropic — Constitutional AI, Claude, alignment research
Papers
- word2vec — Efficient Estimation of Word Representations in Vector Space (Mikolov et al. 2013)
- seq2seq — Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
- bahdanau-attention — Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2015)
- attention-is-all-you-need — Attention Is All You Need (Vaswani et al. 2017)
- gpt-1 — Improving Language Understanding by Generative Pre-Training (Radford et al. 2018)
- bert — BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
- gpt-2 — Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
- t5 — Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al. 2020)
- scaling-laws-neural-lm — Scaling Laws for Neural Language Models (Kaplan et al. 2020)
- gpt-3 — Language Models are Few-Shot Learners (Brown et al. 2020)
- switch-transformer — Switch Transformers: Scaling to Trillion Parameter Models (Fedus et al. 2021)
- roformer-rope — RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021)
- lora — LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
- chain-of-thought-paper — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
- instructgpt — Training Language Models to Follow Instructions with Human Feedback (Ouyang et al. 2022)
- chinchilla — Training Compute-Optimal Large Language Models (Hoffmann et al. 2022)
- flash-attention-paper — FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022)
- constitutional-ai-paper — Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)
- llama — LLaMA: Open and Efficient Foundation Language Models (Touvron et al. 2023)
- gpt-4-technical-report — GPT-4 Technical Report (OpenAI 2023)
- dpo — Direct Preference Optimization (Rafailov et al. 2023)
Applications