Mikolov et al. (2013) introduce two architectures, Continuous Bag-of-Words (CBOW) and Skip-gram, for learning continuous vector representations of words from large unlabeled corpora. Trained on 1.6 billion words in under a day, these models produce word-embeddings that capture both syntactic and semantic regularities, famously enabling linear algebraic analogies such as vector(“King”) - vector(“Man”) + vector(“Woman”) = vector(“Queen”).
Problem
Prior neural language models (feedforward NNLM, RNN-based LM) learned useful word representations but were computationally expensive, limiting training to corpora of a few hundred million words with modest embedding dimensionality (50-100).
Key Contribution
Two simplified architectures that remove the hidden layer bottleneck, enabling training on billion-word corpora at dramatically lower computational cost while producing higher-quality embeddings.
Method
CBOW predicts a target word from its surrounding context words by averaging their projection vectors. Skip-gram inverts this, predicting context words given a center word. Both use hierarchical softmax with Huffman-coded vocabularies to speed up training. Training complexity scales as O(E x T x Q), where Q is much smaller than in NNLM or RNNLM because both architectures eliminate the expensive hidden layer.
Main Results
On a syntactic-semantic analogy test set, Skip-gram with 300-dimensional vectors trained on 1.6 billion words achieved 53.3% semantic and 55.8% syntactic accuracy, substantially outperforming prior NNLM and RNNLM baselines. Accuracy improved log-linearly with training data size and vector dimensionality.
Limitations
The models treat each word as an atomic unit, ignoring subword morphology. The analogy evaluation is limited in scope and does not capture polysemy. Performance depends heavily on hyperparameter tuning (window size, dimensionality, training data).
Impact
Word2Vec became the default word representation method in NLP for several years, enabling rapid progress in downstream tasks. It established the paradigm of pretraining representations on unlabeled text, directly influencing subsequent work on contextual embeddings (ELMo), gpt-1, and bert. The Skip-gram with negative sampling variant became especially widespread. The linear analogy property spurred research into the geometry of word-embeddings.