LLaMA: Open and Efficient Foundation Language Models

Touvron et al. (2023) from meta-ai introduce LLaMA, a family of foundation language models (7B to 65B parameters) trained exclusively on publicly available data. Motivated by chinchilla scaling laws, LLaMA trains smaller models on far more tokens than typical, demonstrating that LLaMA-13B outperforms gpt-3 (175B) on most benchmarks while being 10x smaller.

Problem

State-of-the-art LLMs relied on proprietary datasets and prioritized training the largest possible models, making them inaccessible to the research community. The chinchilla insight that models should be trained longer was underexploited: inference cost, not just training cost, matters at scale.

Key Contribution

A series of competitive open-weight transformer models trained only on publicly available data, optimized for inference budget rather than training compute alone. LLaMA showed that training beyond the compute-optimal point (in the chinchilla sense) produces models that are cheaper to serve while remaining competitive.

Method

LLaMA uses a standard autoregressive transformer with several architectural refinements: RMSNorm pre-normalization, SwiGLU activations, and positional-encoding via RoPE (roformer-rope). Training data is a 1.4T token mixture of publicly available sources: CommonCrawl (67%), C4 (15%), GitHub (4.5%), Wikipedia (4.5%), Books (4.5%), ArXiv (2.5%), and StackExchange (2%). Models range from 7B to 65B parameters, with the larger models trained on 1.4T tokens.

Main Results

LLaMA-13B outperforms gpt-3 (175B) on most benchmarks despite being 10x smaller.
LLaMA-65B is competitive with chinchilla (70B) and PaLM-540B.
Performance of the 7B model continues to improve even after 1T tokens, beyond chinchilla-optimal recommendations for its size.
All models trained entirely on open data, enabling reproducibility.

Limitations

The initial release was restricted to research use, limiting commercial deployment. Models exhibit biases and toxicity typical of web-trained LLMs. No instruction tuning or alignment was included in the base release.

Impact

LLaMA catalyzed the open-source LLM ecosystem. Its weights (after leaking) became the foundation for Alpaca, Vicuna, and hundreds of community fine-tunes including low-rank-adaptation-based variants. It proved that competitive LLMs could be built on public data, democratizing access to foundation models and spawning the LLaMA-2 and LLaMA-3 series.

Sources

LLaMA: Open and Efficient Foundation Language Models (File, DOI)

AI Research Wiki

Explorer