Definition

Word embeddings are dense, low-dimensional vector representations of words learned from large text corpora. Unlike sparse one-hot encodings, embeddings capture semantic and syntactic relationships in continuous vector space, typically ranging from 50 to 300 dimensions.

Key Intuition

Words that appear in similar contexts should have similar representations. This is the distributional hypothesis: “you shall know a word by the company it keeps” (Firth, 1957). Embeddings learned this way encode meaningful structure, famously enabling vector arithmetic like king - man + woman = queen.

History/Origin

Bengio et al. (2003) introduced neural language models that learned word representations as a byproduct. The field accelerated with word2vec (Mikolov et al., 2013), which proposed two efficient architectures: Continuous Bag-of-Words (CBOW), predicting a word from its context, and Skip-gram, predicting context from a word. GloVe (Pennington et al., 2014) combined global co-occurrence statistics with local context window methods. FastText extended Word2Vec with subword information.

Relationship to Other Concepts

Word embeddings were the dominant input representation in NLP before contextual embeddings from pretraining models like bert and gpt-1. They are a foundational instance of transfer-learning, where representations learned on one corpus transfer to downstream tasks. Modern transformer-based models produce context-dependent embeddings that subsume static word vectors.

Notable Results

Word2Vec trained on Google News (100B words) produced vectors exhibiting systematic analogical reasoning. Pre-trained GloVe and Word2Vec embeddings became standard initialization for nearly all NLP models from 2013 to 2018, consistently improving performance over random initialization.

Open Questions

  • How to best debias embeddings that absorb societal stereotypes from training corpora.
  • Whether static embeddings retain practical value in the era of large language models, particularly for resource-constrained settings.
  • Optimal methods for composing word-level embeddings into sentence and document representations without transformers.

Sources

  • Efficient Estimation of Word Representations in Vector Space (File, DOI)