openai (2023) presents GPT-4, a large-scale multimodal transformer model that accepts both image and text inputs and produces text outputs. GPT-4 achieves human-level performance on many professional and academic benchmarks, including scoring in the top 10% on a simulated bar exam, a dramatic improvement over GPT-3.5 (bottom 10%).
Problem
Prior gpt models were text-only, limited in reasoning, and unreliable on complex professional tasks. A key engineering challenge was building training infrastructure whose behavior is predictable across many orders of magnitude of compute scale.
Key Contribution
A frontier multimodal model with predictable scaling: the team developed methods to accurately predict GPT-4’s final loss and capability metrics (e.g., HumanEval pass rate) from models trained with up to 10,000x less compute. The model was aligned via rlhf post-training.
Method
GPT-4 is a transformer-based model pre-trained on next-token prediction using public and licensed data, then fine-tuned with rlhf. The report deliberately omits architecture details, model size, training compute, and dataset specifics, citing competitive and safety considerations. A key technical contribution is the predictable scaling infrastructure: loss follows a power law L(C) = aC^b + c, verified against the final run.
Main Results
- Bar exam: ~90th percentile (vs. ~10th for GPT-3.5).
- LSAT: ~88th percentile; SAT Math: ~89th percentile.
- MMLU: outperforms prior models and most SOTA systems in English; surpasses English SOTA in 24 of 26 languages on translated variants.
- USABO Semifinal: 99th-100th percentile (vs. 31st-33rd for GPT-3.5).
- HumanEval coding: performance accurately predicted from small-scale runs via power law extrapolation.
- Loss prediction from 1,000x smaller models matched actual GPT-4 final loss with high accuracy.
Limitations
GPT-4 still hallucinates, has a limited context window, and does not learn from experience. The report withholds nearly all training details, limiting scientific reproducibility. Competitive coding performance remains low (Codeforces below 5th percentile).
Impact
GPT-4 set a new frontier for LLM capability and demonstrated that scaling-laws extend to predicting downstream task performance, not just loss. Its multimodal capabilities accelerated work on vision-language models. The opacity of the technical report intensified the open-source response, motivating projects like llama and others to close the gap.