Enriching Word Vectors with Subword Information

Piotr Bojanowski; Edouard Grave; Armand Joulin; Tomas Mikolov

Enriching Word Vectors with Subword Information

Intermediate

Piotr Bojanowski, Edouard Grave, Armand Joulin et al.7/15/2016

arXiv

Key Summary

•This paper teaches computers to understand words by also looking at the smaller pieces inside words, like 'un-', 'play', and '-ing'.
•Instead of giving each word its own lonely vector, it builds a word’s vector by adding up vectors of its character n-grams (tiny letter chunks).
•Because it uses subword pieces, it can make good vectors even for words it never saw before (out-of-vocabulary words).
•It keeps the fast skip-gram training idea but upgrades it to learn from character n-grams, so training stays efficient.
•The method works especially well for languages with many word endings and compounds, like German, Czech, and Russian.
•On tests of word similarity and analogies, it usually beats classic word2vec models, especially on grammar-related (syntactic) questions.
•It still helps when you have only a little data, which is great for small or special-topic datasets.
•Longer letter chunks (like 3–6 characters) matter a lot, because they capture roots, prefixes, suffixes, and even compounds.
•When used to start a language model, these vectors lower perplexity (make predictions more certain), especially in morphologically rich languages.
•Overall, it’s a simple, fast, and practical way to make smarter word vectors using the letters inside words.

Why This Research Matters

This approach makes AI better at handling the words people actually use, including rare, new, and morphologically complex ones. That means smarter spell-checkers, search engines, and chatbots that don’t get confused by unusual forms or new slang. It helps low-resource languages and special fields (like medicine or law) where data is scarce, because it learns a lot from little. It also speeds up building practical systems since it uses plain text and simple training. In voice assistants and translation, better handling of word forms makes results clearer and more accurate. Overall, it’s a big step toward language tools that feel more natural and robust in everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can often guess what a new word means if you notice familiar pieces inside it, like 'micro-' in 'microscope' or '-ness' in 'kindness'? Those pieces are powerful clues.

🥬 The Concept: Word vectors are a way to turn words into numbers so computers can measure how similar words are. How it works:

Collect a big pile of text and look at how words appear near each other.
Give each word a point (a vector) in space so similar words are close together.
Use those vectors to compare words or solve puzzles like analogies. Why it matters: Without good word vectors, computers can’t easily understand meaning, find similar words, or do many language tasks well. 🍞 Anchor: If 'cat' and 'kitten' end up near each other in vector space, a program can learn they’re related.

🍞 Hook: Imagine you’re learning Spanish verbs. One verb like 'hablar' (to speak) has many forms (hablo, hablas, habló…). They look different on the surface but are really related.

🥬 The Concept: Morphologically rich languages are languages where words change their forms a lot to show things like tense, case, or number. How it works:

Base words get extra bits (prefixes/suffixes or inside changes).
These bits follow rules that carry meaning (like past tense or plural).
The same root appears in many surface forms. Why it matters: If a model treats every form as totally new, many rare forms won’t be learned well. 🍞 Anchor: In German, 'Tischtennis' (table tennis) combines two words; a model that notices 'Tennis' inside it will understand it better.

🍞 Hook: When you see a long new word, you naturally scan it for known chunks—like reading a big Lego build by spotting the bricks.

🥬 The Concept: Character n-grams are short, overlapping sequences of letters inside a word (like 'pla', 'lay', 'ayi' in 'playing'). How it works:

Wrap the word with start/end markers (like < and >) to capture edges.
Break it into all letter chunks of lengths 3 to 6.
Learn a small vector for each chunk. Why it matters: These chunks let the model share knowledge across related words and handle rare or unseen words. 🍞 Anchor: The word 'where' turns into <wh, whe, her, ere, re> and <where>, and those chunks build its vector.

🍞 Hook: If you can guess a missing word by looking at the words around it, you already understand the key training trick here.

🥬 The Concept: The skip-gram model is a way to learn word vectors by predicting surrounding words (context) from a given word. How it works:

Pick a word in a sentence.
Look a few words to the left and right (its context).
Adjust vectors so the word becomes good at predicting its neighbors. Why it matters: This simple task creates meaningful vectors fast, from plain text. 🍞 Anchor: Seeing “The cat sat on the ____,” the model nudges vectors so 'mat' gets more likely.

🍞 Hook: Practicing only the toughest problems can make you improve faster than drilling everything.

🥬 The Concept: Negative sampling is a training shortcut that makes learning fast by focusing on a few “wrong” words each time. How it works:

For a real pair (word, context), also sample a few random words as fake contexts.
Teach the model to score the real pair high and the fake ones low.
Repeat for many pairs so the model learns patterns quickly. Why it matters: Without it, training would be too slow to handle large vocabularies. 🍞 Anchor: When 'dog' appears near 'barked', the model also sees random words like 'banana' as negatives to learn the difference.

The world before: Classic word2vec (skip-gram/CBOW) gave each word its own vector. This was great for common words in English but struggled with rare words and languages where words change form a lot. If a word never showed up at training time, it got no vector at all.

The problem: Many languages (German, Russian, Czech, Turkish, Finnish, etc.) produce lots of word forms. Many forms are rare, so the model sees them too few times to learn good vectors. And when a brand-new word appears, the model can’t handle it because there’s no vector.

Failed attempts: Some methods required special tools to split words into morphemes (tiny meaning units). Others used character-level neural networks, which can be accurate but more complex and slower, or they needed special annotated data. These approaches can work, but they add cost, complexity, or language-specific tools.

The gap: We needed something simple, fast, language-agnostic, and good with rare/new words—ideally trained on plain text with no extra tools.

Real stakes: Better word vectors help with search suggestions, autocorrect, chatbots, voice assistants, translation, and reading messy text like social media. If a system can handle rare or new words gracefully, it feels smarter and more helpful in everyday life.

02Core Idea

🍞 Hook: Building a word’s meaning from its letter pieces is like making a smoothie from fruits—you can mix familiar ingredients to create a new flavor.

🥬 The Concept: The key insight is to represent each word as the sum of vectors for its character n-grams (little letter chunks), and train these inside a skip-gram model. How it works:

Break each word into overlapping letter chunks (lengths 3–6), plus the full word with special boundary markers.
Give each chunk its own small vector.
Represent a word by adding up its chunk vectors.
Train with skip-gram so these chunk vectors get tuned by real contexts. Why it matters: Now, even unseen words can get a good vector by summing the vectors of their chunks. 🍞 Anchor: 'Tischtennis' includes pieces like 'Tisch' and 'tennis'; their chunk vectors help the full word vector land near sports words.

Three analogies to deepen it:

Lego bricks: Words are Lego models. The n-grams are bricks. Build any new model by reusing the same bricks in new combinations.
Family resemblances: Suffixes like '-ed' or '-ing' are like family traits; sharing them helps the model see relations among verbs.
Barcodes: Each set of n-grams is like a barcode for a word; similar barcodes mean similar words.

Before vs after:

Before: One vector per word, no sharing. Rare forms and unseen words either get poor vectors or none at all.
After: Vectors share parts across words. Rare and unseen words borrow strength from their pieces. Compound words and inflections are handled more naturally.

🍞 Hook: Imagine sorting library books by noticing shared roots like 'bio' or 'micro' even in brand-new titles.

🥬 The Concept: Why this works—intuition without equations. How it works:

Many words share roots, prefixes, and suffixes that carry meaning.
By giving those parts vectors, the model reuses meaning across many words.
Adding the parts together blends their meanings, much like mixing colors. Why it matters: This sharing is crucial for languages with many forms and for rare words, boosting accuracy without extra tools. 🍞 Anchor: 'Happiness', 'kindness', and 'sadness' all end with '-ness'. The model learns a consistent '-ness' meaning and applies it everywhere.

Building blocks of the idea:

🍞 Hook: You know how you can understand a long word by spotting the start and end pieces? 🥬 The Concept: Boundary symbols mark beginnings/ends (<word>), so the model tells prefixes/suffixes apart from middles. How it works: Add < and > before/after a word, then make n-grams. '<un' is a prefix; 'ed>' is a suffix. Why it matters: Prefixes and suffixes often carry grammar and meaning. 🍞 Anchor: In 'unlucky', '<un' signals negation and 'ky>' helps spot the word ending.
🍞 Hook: Sharing is caring—especially in memory. 🥬 The Concept: Hashing maps lots of possible n-grams into a fixed number of buckets to keep memory small. How it works: Use a fast hash function to send each n-gram to one of about two million slots; store one vector per slot. Why it matters: There are tons of possible letter chunks; hashing makes it feasible to train on big data. 🍞 Anchor: Even if two rare chunks collide into the same bucket, it usually doesn’t hurt much, and training stays fast.
🍞 Hook: Practice with helpful examples beats trying to consider everything at once. 🥬 The Concept: Skip-gram with negative sampling tunes the chunk vectors using real contexts and a few random fakes per step. How it works: For each center word, predict nearby words, push scores up for real pairs, down for sampled wrong ones. Why it matters: It’s fast and proven to build meaningful vectors from plain text. 🍞 Anchor: For 'doctor' near 'hospital', we push that pair up and push down 'doctor' with random words like 'banana'.

The bottom line: A simple, additive mix of subword pieces, trained with a classic fast recipe, unlocks strong, flexible word vectors—especially where it used to be hardest.

03Methodology

High-level overview: Input text → Make word and character n-gram inventories → Train with skip-gram + negative sampling using word = sum of its n-gram vectors → Output word and n-gram vectors that handle OOV words.

Step A: Prepare words and their n-grams

What happens: For each word, add boundary markers (< and >), then extract all character n-grams of lengths 3 to 6, and also include the full word with markers.
Why this step exists: It captures roots, prefixes, suffixes, and even compounds; without it, the model can’t share meaning across related forms.
Example: 'where' → <where>, and 3-grams: <wh, whe, her, ere, re>.

🍞 Hook: Think of squeezing a long parade of letters into a manageable parking lot. 🥬 The Concept: Hash n-grams to a fixed set of buckets. How it works:

Apply a fast hash (like FNV-1a) to each n-gram.
Map it into one of about two million indices.
Store one vector per index; learn all of them during training. Why it matters: Keeps memory and speed under control without listing every possible chunk. 🍞 Anchor: Two rare chunks may share a slot, but the model still trains well in practice.

Step B: Define the word vector as a sum of its chunk vectors

What happens: A word’s representation u_w is the sum of vectors z_g over its n-grams g in that word.
Why this step exists: It lets related words share parts; without it, every word would be isolated, making rare words weak.
Example: u_<where> = z_<wh + z_whe + z_her + z_ere + z_re + z_<where>.

Step C: Train with skip-gram and negative sampling

What happens:
1. Slide a window (size randomly between 1 and 5) over text.
2. For each center word w_t, compute its vector by summing its n-gram vectors.
3. Pick real context words nearby as positives.
4. Sample a few random words as negatives (about 5 per positive), weighted by word frequency.
5. Increase the score for real pairs; decrease for negative pairs.
6. Update both the n-gram vectors (for the center word) and the context word vectors.
Why this step exists: It’s a fast, scalable way to push similar words together and dissimilar words apart. Without it, learning would be slow and less effective.
Example: In “The quick brown fox jumps over the lazy dog”, if 'fox' is center, 'brown' and 'jumps' might be positives, while random words like 'planet' are negatives.

🍞 Hook: Training is like jogging—start fast, slow down gently. 🥬 The Concept: Learning rate decay and parallel updates. How it works:

Start with a higher step size (e.g., 0.05) and linearly reduce it as training progresses.
Use multiple CPU threads (Hogwild) to update shared parameters without locks. Why it matters: Keeps training quick and steady on large corpora. 🍞 Anchor: Many joggers (threads) running on the same track (parameters) still make good overall progress.

Step D: Regular choices that keep it robust

Word subsampling: Drop some very frequent words (like 'the') to focus on informative signals; without this, common words dominate.
Vector size: Use 300 dimensions—big enough to hold nuance, small enough for speed.
OOV handling: For a word never seen in training, build its vector on the fly by summing its n-gram vectors; without this, OOV words would be invisible.

Putting it together (a mini run-through):

Input: A Wikipedia sentence in German: “Tischtennis ist beliebt.”
N-grams: Extract 3–6-grams from 'Tischtennis' with < and > markers; many capture 'Tisch' and 'tennis'.
Sum: Make the word vector by adding all those chunk vectors.
Train: When 'Tischtennis' appears near sports words, push those pairs’ scores up and random negatives down, updating the chunk vectors.
Result: The 'Tischtennis' vector ends up near other sports words because its chunks learned from meaningful contexts.

The secret sauce:

Summation of subwords: A super-simple composition function that is fast and works astonishingly well.
Boundary-aware chunks: Distinguish prefixes/suffixes from middles with < and >, making grammar bits pop out.
Hashing trick: Fixed memory, large coverage, and good-enough collisions for speed and scale.
Classic training, modern twist: Skip-gram + negative sampling, but the center word is a smart sum of its parts.

Performance and efficiency notes:

Training speed is roughly in the same ballpark as classic skip-gram (about 1.5× slower), still very fast on CPUs.
The gains are biggest for rare words and morphologically rich languages.
Longer n-grams (3–6) prove especially helpful, capturing roots, affixes, and compounds.

04Experiments & Results

🍞 Hook: If you invent a new word like 'snorfle', can the model still guess what it relates to? That’s the kind of test that shows real understanding.

🥬 The Concept: The tests measure three things—word similarity, analogies, and language modeling. How it works:

Word similarity: Compare model’s similarity scores to human judgments (using Spearman correlation).
Analogies: Solve A:B :: C:? puzzles (semantic like 'Paris:France :: Tokyo:Japan' and syntactic like 'run:running :: swim:swimming').
Language modeling: Predict the next word well (low perplexity means more certainty). Why it matters: These cover meaning, grammar, and predictive power—core skills for NLP. 🍞 Anchor: If the model says 'king' is closer to 'queen' than to 'apple', it’s on the right track.

The competition:

Baselines: Classic word2vec skip-gram (sg) and CBOW.
Morphology-aware methods: Models using morpheme segmentation or transformation rules.
Character-aware language model (CNN/LSTM-based) and a log-bilinear model for language modeling.

Scoreboard with context:

Word similarity across 9 languages (Arabic, Czech, German, English, Spanish, French, Italian, Romanian, Russian): The subword model (sisg) beats classic baselines on most datasets. Where words are common (like English WS353), it ties or is close; where words are rare or morphology is heavy (German/Russian), it clearly wins. Think of it as moving from a solid B to A-/A on average, with A+ in tough languages.
Rare words (English RW): The subword model does better than baselines, because it can build vectors from pieces even when whole words are rare—like recognizing a new Lego model from familiar bricks.
Analogies: Big gains on syntactic questions (grammar/inflections) in Czech, German, English, Italian. Semantic analogies are similar or slightly worse in some cases, showing the model’s strength leans toward forms and patterns, not world facts. Imagine acing the “grammar” section and getting a strong but not always top “knowledge” score.
Language modeling (Czech, German, Spanish, French, Russian): Initializing an LSTM with these subword-enriched vectors reduces test perplexity versus plain skip-gram vectors—especially in Czech and Russian (about 8–13% better), a clear step from a B+ to an A.

Surprising findings:

OOV superpower: When evaluation sets include words unseen in training, classic baselines use empty placeholders and stumble. The subword model builds non-empty, meaningful vectors from chunks and stays strong.
Longer chunks help: Using 3–6 character n-grams captures not only suffixes/prefixes but also compound pieces in languages like German. Using shorter chunks (like only 2-grams) isn’t enough.
Small-data boost: With only 1–5% of Wikipedia, the subword model can still match or beat baselines trained on all data. That’s a big deal for niche domains.
Saturation: As data grows huge, the gap narrows on some tasks; the model seems to reach a good plateau early.

Concrete snippets:

German similarity (GUR350): With only 5% data, the subword model scores around mid-60s correlation, beating CBOW on full data—like finishing early and still scoring higher.
English RW (rare words): Subword beats CBOW even when trained on just 1% of data, thanks to chunk-based generalization.
Language modeling: On Russian, initializing with subword vectors drops perplexity notably (around 13% vs sg), showing better certainty in predictions.

Takeaways:

Best in class for morphology and rare words.
Consistent OOV handling is a practical win.
Longer n-grams matter; they’re not just decoration.
Semantic world knowledge may need other signals, but syntax/morphology shine here.

05Discussion & Limitations

Limitations:

Semantic breadth: The model shines at forms (prefixes/suffixes, compounding) but doesn’t add extra world knowledge by itself, so semantic analogies sometimes don’t improve.
Order loss: A bag of n-grams ignores precise character order beyond chunk size; subtle differences can be missed.
N-gram lengths: Choosing 3–6 was sensible but not always optimal; different languages or tasks might prefer other ranges.
Hash collisions: Mapping many n-grams into limited buckets is efficient but can mix a few unrelated chunks.
Saturation: With very large data, gains can plateau; other techniques might be needed to keep improving.

Required resources:

Text data: Works with plain text; more and more diverse data generally helps.
Compute: Fast on CPUs, benefits from multi-threading; memory for vectors and hash buckets (~millions) is needed.
Tuning: Basic hyperparameters (window size, negatives, learning rate decay) need light tuning, though defaults often work.

When not to use:

If you need deep world knowledge or precise word senses (e.g., 'bank' of river vs money), you might pair this with larger contextual models.
For scripts or languages where characters don’t map neatly into alphabetic chunks (some logographic systems), you may need different subword units.
If device memory is extremely constrained and even hashed n-gram tables are too large, you might require smaller models.

Open questions:

How to automatically pick optimal n-gram ranges per language/task?
Can we combine this with learned subword segmentations (like BPE) or morpheme analyzers to get the best of both worlds?
Can we add light positional info or smarter composition than a simple sum without losing speed?
How to better integrate semantic world knowledge while keeping the simplicity and speed?
What’s the best way to transfer this to highly non-alphabetic languages?

Bottom line: This method is a sweet spot of simplicity, speed, and surprising power for morphology and rare words, but it’s not a magic wand for deep semantics. It plays well as a strong, efficient foundation and pairs nicely with higher-level models.

06Conclusion & Future Work

Three-sentence summary:

This paper enriches word vectors by building them from character n-grams and training with a skip-gram objective, so meaning is shared across related word forms.
It’s fast, simple, and handles unseen (OOV) words by composing their vectors from known letter chunks, which shines in morphologically rich languages and with small datasets.
On similarity, analogy, and language modeling benchmarks across multiple languages, it usually outperforms classic baselines and competes well with more complex morphology-aware methods.

Main achievement:

A practical, subword-based extension of skip-gram that brings powerful gains for rare words and morphology without requiring language-specific tools.

Future directions:

Auto-tune n-gram ranges per language; explore mixing with BPE or morpheme analyzers; add light positional or context-aware composition to strengthen semantics.
Extend to scripts with different writing systems; study smarter hashing or adaptive buckets; pair with modern contextual models for richer understanding.

Why remember this:

It showed that a very simple idea—add up vectors of letter chunks—can solve big real problems (OOV words, morphology) at scale.
This insight powered practical tools (like fastText) that many people use today for quick, strong baselines and production systems.
It’s a reminder that clever simplicity often wins: share parts, learn fast, and generalize to the words you haven’t seen yet.

Practical Applications

•Improve search and autocomplete by understanding rare and misspelled queries using subword chunks.
•Boost spell-check and autocorrect to suggest smarter fixes based on word pieces and roots.
•Enhance chatbots and virtual assistants to handle new product names or slang without retraining from scratch.
•Strengthen sentiment and topic analysis on social media where creative spellings and OOV words are common.
•Initialize language models in morphologically rich languages for better predictions with less data.
•Support machine translation for rare words and compounds by composing reasonable vectors on the fly.
•Build domain-specific embeddings (medical, legal) that still work well when training data is limited.
•Aid named entity recognition by linking unseen names via shared subword parts (e.g., prefixes/suffixes).
•Improve OCR and ASR post-processing by better guessing word forms from noisy outputs.
•Accelerate rapid prototyping (fastText-style) for teams needing quick, strong baselines on new corpora.

Version: 1