Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. (2022) from google-brain demonstrate that chain-of-thought prompting, where few-shot exemplars include intermediate reasoning steps, dramatically improves the ability of large language models to perform arithmetic, commonsense, and symbolic reasoning tasks. The technique requires no fine-tuning or architectural changes and emerges as an ability in sufficiently large models.

Problem

Scaling model size alone does not solve reasoning tasks. Standard in-context-learning with input-output exemplars fails on multi-step problems like math word problems and symbolic manipulation, even at the scale of 540B parameters. Prior approaches to elicit reasoning required task-specific fine-tuning with curated rationale datasets.

Key Contribution

A simple prompting technique: augment each few-shot exemplar with a natural language “chain of thought” showing intermediate reasoning steps before the final answer. This decomposes multi-step problems, allocates additional computation to harder problems, and provides an interpretable window into model reasoning. The method works off-the-shelf with no training, just by changing the prompt.

Method

The authors manually write 8 chain-of-thought exemplars per task (no prompt engineering). They evaluate three model families: GPT-3 (from openai), LaMDA, and PaLM across arithmetic benchmarks (GSM8K, SVAMP, ASDiv, AQuA, MAWPS), commonsense benchmarks (CSQA, StrategyQA), and symbolic reasoning tasks (last-letter concatenation, coin flip). Standard prompting (answer only) serves as the baseline.

Main Results

PaLM 540B with chain-of-thought prompting achieves 57% solve rate on GSM8K, surpassing fine-tuned gpt-3 with a verifier (55%) and prior best (33%). On SVAMP, PaLM 540B + CoT reaches 79.0%. Chain-of-thought prompting is an emergent ability: it provides little or no benefit for models below ~100B parameters but yields large gains at 540B. On commonsense tasks, CoT improves accuracy on StrategyQA to 73.9% (prior SoTA 69.4%). The gains are robust to different annotators writing the exemplar chains.

Limitations

Chain-of-thought prompting only helps sufficiently large models (~100B+ parameters), limiting practical applicability. The chain of thought is not guaranteed to be faithful to the model’s actual computation. The method does not help with tasks where reasoning steps are unclear or unnecessary. Correct chains of thought do not guarantee correct final answers, and incorrect chains can still produce correct answers by coincidence.

Impact

This paper launched a major research direction in chain-of-thought reasoning, leading to self-consistency (Wang et al., 2022), tree-of-thought, and zero-shot CoT (“Let’s think step by step”). Chain-of-thought became a standard technique in prompt engineering and a key motivation for reasoning-focused models. The work reframed scaling discussions from raw performance to emergent capabilities, influencing how the field thinks about what abilities arise with scale.

Sources

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (File, DOI)

AI Research Wiki

Explorer