Embedding Spaces & Distributed Representations
Key Points
- •Embedding spaces map discrete things like words or products to dense vectors so that similar items are close together.
- •Distances and angles in these spaces (often cosine similarity) reflect semantic or functional similarity.
- •An embedding matrix is just a big table of numbers; looking up an item selects a row, which is its vector.
- •We train embeddings by predicting contexts, co-occurrences, or labels so that useful geometry emerges.
- •Cosine similarity, dot product, and Euclidean distance are the core operations used for search and clustering.
- •Training at scale relies on tricks like negative sampling, mini-batching, and regularization.
- •Choosing the dimension d is a trade-off between expressiveness, speed, and overfitting.
- •Normalization and careful evaluation (nearest neighbors, clustering quality, downstream tasks) are essential.
Prerequisites
- →Linear Algebra Fundamentals — Embeddings rely on vectors, dot products, norms, and matrices.
- →Basic Probability — Negative sampling, softmax, and probabilistic objectives use distributions.
- →Gradient Descent and Optimization — Training embeddings is an optimization problem over parameters.
- →Arrays and Memory Layout — Embedding tables are large arrays; cache/memory access patterns matter.
- →Vector Similarities and Distances — Cosine, dot product, and Euclidean distance define neighborhoods in embedding space.
- →C++ Basics and Standard Library — Implementing lookup, training steps, and search efficiently requires C++ proficiency.
- →Dimensionality Reduction — Compression techniques like PCA or random projections are common with embeddings.
Detailed Explanation
Tap terms for definitions01Overview
Embedding spaces and distributed representations turn symbolic objects (words, users, products, molecules) into dense numeric vectors in (\mathbb{R}^d). Instead of treating items as unrelated IDs, embeddings arrange them so that geometric relationships—distances and angles—capture meaningful notions of similarity. For example, in a good word embedding space, vectors for "cat" and "dog" are closer to each other than to "banana". These vectors are learned from data by optimizing objectives that reward useful proximity (e.g., predicting context words, clicks, or labels).
The central tool is an embedding matrix: one row per item, with d coordinates each. Looking up an item returns its vector, and simple linear algebra (dot products, norms, cosine similarity) powers retrieval, clustering, and recommendation. Training methods include word2vec (skip-gram with negative sampling), GloVe (matrix factorization of co-occurrence), and contrastive or triplet losses in deep models. Dimensionality reduction (e.g., PCA or random projections) helps compress or visualize spaces.
Why this matters: vectors make learning and search fast. Once items live in a shared space, you can do k-nearest-neighbor search for recommendation, compute centroid prototypes for classification, or measure diversity. Embeddings are now a universal interface between raw symbols and machine learning models.
02Intuition & Analogies
Imagine organizing a library without reading the books. You place books on shelves so that related topics end up near each other. Over time, you learn to co-locate physics with math, and cooking near nutrition. An embedding space is like a huge, precise library where every item has an address in a multi-dimensional room. The closer two items are, the more they tend to be used together, co-occur, or share properties.
Another analogy: coordinates on a world map. A city’s latitude and longitude are just two numbers that tell you where it sits; nearby cities likely share climate and culture. Embeddings give each item many coordinates (perhaps 100–1000). Two items with similar coordinates are likely related. The “direction” between two points might encode a concept: moving in the “plural” direction could turn "car" into "cars", or in a “royalty” direction might relate "man" to "king" and "woman" to "queen".
Think of a recipe as a list of ingredients and amounts. Two recipes that use similar ingredients in similar proportions taste alike. If you represent each recipe as a vector of ingredient quantities, comparing recipes becomes comparing vectors. Likewise, if two words appear in similar contexts (the distributional hypothesis: “You shall know a word by the company it keeps”), their vectors should resemble each other.
Finally, picture magnets on a whiteboard. During training, we nudge magnets so that items that should be together pull closer, and unrelated items push apart. After many nudges guided by data, a stable arrangement emerges where geometry mirrors meaning.
03Formal Definition
04When to Use
Use embeddings whenever you have discrete identifiers but want to generalize or compute similarity:
- Language and text: represent words, subwords, documents for search, topic modeling, or as inputs to neural networks.
- Recommendations: map users and items to a joint space where inner products approximate click or purchase probabilities.
- Knowledge graphs: embed entities and relations to enable link prediction.
- Bioinformatics and chemistry: embed proteins, genes, or molecules to capture structural or functional similarity.
- Vision and audio: embed images or clips so that similar content clusters together; e.g., face recognition via triplet loss.
Embeddings shine when labels are sparse but co-occurrence or interaction data is plentiful. They enable fast similarity search and “cold-start” generalization to unseen combinations. Choose cosine similarity when only direction matters (scale-invariant comparisons), dot product when magnitude carries signal (as in implicit-feedback recommenders), and Euclidean distance when your loss or model assumes it.
Do not overuse embeddings for tiny vocabularies (one-hot may suffice) or when interpretability is paramount (dimensions lack direct meaning). If exactness is critical, be careful with approximate nearest neighbor methods; for high-dimensional spaces, their speed-ups come with recall trade-offs.
⚠️Common Mistakes
- Skipping normalization when using cosine similarity: cosine is scale-invariant, but if you compare unnormalized vectors with a dot product and call it “cosine,” rankings will be biased by vector lengths. Normalize vectors to unit norm when you intend cosine.
- Confusing metrics: Euclidean distance and cosine similarity induce different neighborhoods, especially when norms vary. Decide based on your loss and evaluation task, and keep it consistent between training and retrieval.
- Dimension d too small or too large: too small underfits (can’t separate concepts), too large overfits and costs memory/compute. Start with a heuristic (e.g., 50–300 for small vocabularies, 256–1024 for large) and tune.
- Poor negative sampling: uniform negatives in skewed vocabularies yield slow or unstable training. Use frequency-smoothed distributions (e.g., (p(w) \propto f(w)^{3/4})).
- Ignoring OOV (out-of-vocabulary) handling: freezing at training vocabulary can break downstream use. Add UNK embeddings, subword models, or hashing tricks.
- Training–inference mismatch: training with dot product but retrieving with cosine (or vice versa) can degrade performance. Match the similarity used in loss to the one used at query time or include explicit normalization layers.
- Not monitoring drift and bias: embedding spaces can encode dataset biases or drift over time. Periodically recalibrate, regularize, and evaluate fairness and stability.
- Numerical instability: sigmoid on large-magnitude dot products can overflow. Use stable implementations and consider clipping or temperature scaling.
Key Formulas
Embedding Lookup
Explanation: The embedding of item w is the w-th row of the embedding matrix E. Multiplying a one-hot vector by E selects that row.
Dot Product
Explanation: Measures alignment between two vectors by summing elementwise products. Larger values indicate higher alignment.
Cosine Similarity
Explanation: Angle-based similarity that ignores vector magnitudes. Often used for text and retrieval where direction matters more than scale.
Euclidean Distance
Explanation: Straight-line distance between vectors. Useful when the learning objective assumes Euclidean geometry.
Softmax
Explanation: Turns scores into a probability distribution over classes or contexts. Used in language models and classification.
Cross-Entropy Loss
Explanation: Measures the gap between the true one-hot label y and predicted probabilities p. For a single true class, it reduces to -log of that class's probability.
Skip-gram Negative Sampling
Explanation: Encourages target vector and true context to align while pushing away K sampled negatives . This yields efficient training for word embeddings.
L2 Regularization
Explanation: Penalizes large weights to prevent overfitting and improve generalization. The Frobenius norm is the L2 norm over all matrix entries.
Triplet Loss
Explanation: For anchor a, positive p, and negative n, enforces that a is closer to p than to n by at least margin m under distance d.
PCA Projection
Explanation: Compute covariance C of centered data X, take top-m eigenvectors , and project X to lower dimension Z. Preserves maximal variance linearly.
Johnson–Lindenstrauss Bound
Explanation: Number of random projection dimensions m required to preserve pairwise distances within (1± for n points with high probability.
Nearest Neighbor Retrieval
Explanation: Given a query vector , return the item i with the highest similarity under s. For k-NN, return the top-k indices.
k-Means Objective
Explanation: Partitions vectors into clusters by minimizing within-cluster squared distances to centroids Useful to summarize or segment embedding spaces.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute L2 norm of a vector 5 double l2norm(const vector<double>& v){ 6 double s = 0.0; for(double x : v) s += x*x; return sqrt(max(0.0, s)); 7 } 8 9 // Normalize a vector to unit L2 norm (in-place) 10 void normalize(vector<double>& v){ 11 double n = l2norm(v); if(n == 0) return; for(double &x : v) x /= n; 12 } 13 14 // Dot product 15 double dot(const vector<double>& a, const vector<double>& b){ 16 double s = 0.0; size_t d = a.size(); 17 for(size_t i=0;i<d;++i) s += a[i]*b[i]; 18 return s; 19 } 20 21 // Return top-k nearest neighbors by cosine (assumes all rows normalized) 22 vector<pair<int,double>> topk_cosine(const vector<vector<double>>& E, const vector<double>& q, int k){ 23 vector<pair<double,int>> scores; scores.reserve(E.size()); 24 for(size_t i=0;i<E.size();++i){ 25 scores.emplace_back(dot(E[i], q), (int)i); 26 } 27 // partial_sort to get top-k 28 nth_element(scores.begin(), scores.begin()+k, scores.end(), [](auto& a, auto& b){return a.first > b.first;}); 29 sort(scores.begin(), scores.begin()+k, [](auto& a, auto& b){return a.first > b.first;}); 30 vector<pair<int,double>> ans; ans.reserve(k); 31 for(int i=0;i<k;++i) ans.emplace_back(scores[i].second, scores[i].first); 32 return ans; 33 } 34 35 int main(){ 36 ios::sync_with_stdio(false); 37 cin.tie(nullptr); 38 39 // Toy vocabulary 40 vector<string> vocab = {"cat","dog","banana","apple","lion","car"}; 41 // Toy 3D embeddings (random-ish but we'll normalize) 42 vector<vector<double>> E = { 43 {0.9, 0.1, 0.0}, // cat 44 {0.88, 0.12, 0.05},// dog 45 {0.0, 0.9, 0.1}, // banana 46 {0.05, 0.88, 0.12},// apple 47 {0.95, 0.05, 0.0}, // lion 48 {0.1, 0.05, 0.95} // car 49 }; 50 51 // Normalize all rows for cosine 52 for(auto& v : E) normalize(v); 53 54 // Query: use the vector for "cat" 55 vector<double> q = E[0]; // already normalized 56 57 int k = 3; 58 auto nn = topk_cosine(E, q, k); 59 60 cout << "Top-" << k << " nearest to '" << vocab[0] << "':\n"; 61 for(auto [idx, score] : nn){ 62 cout << fixed << setprecision(4) << vocab[idx] << "\t" << score << "\n"; 63 } 64 return 0; 65 } 66
We build a tiny embedding table, L2-normalize all rows, and compute cosine similarities to a query vector using dot products (since normalization makes cosine equal to dot). We then find the top-k nearest neighbors with partial sorting. This mirrors how semantic search or recommendation retrieves similar items.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Numerically stable sigmoid 5 inline double sigmoid(double x){ 6 if(x >= 0){ 7 double z = exp(-x); return 1.0 / (1.0 + z); 8 } else { 9 double z = exp(x); return z / (1.0 + z); 10 } 11 } 12 13 // Perform one SGNS update for (target t, positive context c) with K negatives 14 void sgns_update(vector<vector<double>>& W_in, vector<vector<double>>& W_out, 15 int t, int c, const vector<int>& negatives, double lr){ 16 int d = (int)W_in[0].size(); 17 vector<double> &v = W_in[t]; // input embedding for target 18 vector<double> grad_v(d, 0.0); // accumulate gradient for v 19 20 // Positive pair gradient 21 vector<double> &u_pos = W_out[c]; 22 double pos_score = inner_product(v.begin(), v.end(), u_pos.begin(), 0.0); 23 double pos_sig = sigmoid(pos_score); // sigma(u_c^T v) 24 double coeff_pos = (pos_sig - 1.0); // dL/d(u_c^T v) 25 26 // Update u_pos and accumulate grad for v 27 for(int i=0;i<d;++i){ 28 double gu = coeff_pos * v[i]; // dL/du_c_i 29 u_pos[i] -= lr * gu; 30 grad_v[i] += coeff_pos * u_pos[i]; // note: uses updated u_pos (OK for SGD variant) 31 } 32 33 // Negatives: for each n, add sigma(u_n^T v) * u_n to grad_v and update u_n 34 for(int n_id : negatives){ 35 vector<double> &u_neg = W_out[n_id]; 36 double neg_score = inner_product(v.begin(), v.end(), u_neg.begin(), 0.0); 37 double neg_sig = sigmoid(neg_score); // sigma(u_n^T v) 38 double coeff_neg = neg_sig; // dL/d(u_n^T v) 39 for(int i=0;i<d;++i){ 40 double gu = coeff_neg * v[i]; // dL/du_n_i 41 u_neg[i] -= lr * gu; // gradient ascent on log sigma(-s) is gradient descent here via coeff_neg 42 grad_v[i] += coeff_neg * u_neg[i]; 43 } 44 } 45 46 // Finally update v (target embedding) 47 for(int i=0;i<d;++i){ v[i] -= lr * grad_v[i]; } 48 } 49 50 // Sample K negatives from a given cumulative distribution function (CDF) 51 vector<int> sample_negatives(const vector<double>& cdf, int K, mt19937& rng){ 52 uniform_real_distribution<double> U(0.0, 1.0); 53 vector<int> out; out.reserve(K); 54 for(int k=0;k<K;++k){ 55 double r = U(rng); 56 int idx = (int)(lower_bound(cdf.begin(), cdf.end(), r) - cdf.begin()); 57 if(idx == (int)cdf.size()) idx = (int)cdf.size()-1; 58 out.push_back(idx); 59 } 60 return out; 61 } 62 63 int main(){ 64 ios::sync_with_stdio(false); 65 cin.tie(nullptr); 66 67 // Small vocab and dim 68 int V = 6, d = 8; double lr = 0.05; int K = 4; 69 70 // Initialize embeddings with small random values 71 mt19937 rng(42); 72 normal_distribution<double> N(0.0, 0.1); 73 vector<vector<double>> W_in(V, vector<double>(d)), W_out(V, vector<double>(d)); 74 for(int i=0;i<V;++i){ 75 for(int j=0;j<d;++j){ W_in[i][j] = N(rng); W_out[i][j] = N(rng); } 76 } 77 78 // Example frequencies (smoothed) for negative sampling 79 vector<double> freq = {10, 12, 3, 4, 9, 2}; 80 double sum = 0.0; for(double f : freq) sum += pow(f, 0.75); 81 vector<double> prob(V), cdf(V); 82 for(int i=0;i<V;++i){ prob[i] = pow(freq[i], 0.75) / sum; } 83 partial_sum(prob.begin(), prob.end(), cdf.begin()); 84 85 // One observed (target,context) pair, e.g., ("cat","dog") as ids (0,1) 86 int t = 0, c = 1; 87 vector<int> negatives = sample_negatives(cdf, K, rng); 88 89 // Before update: compute pos score 90 double before = inner_product(W_in[t].begin(), W_in[t].end(), W_out[c].begin(), 0.0); 91 92 sgns_update(W_in, W_out, t, c, negatives, lr); 93 94 double after = inner_product(W_in[t].begin(), W_in[t].end(), W_out[c].begin(), 0.0); 95 96 cout << fixed << setprecision(6); 97 cout << "Positive score before: " << before << "\n"; 98 cout << "Positive score after : " << after << "\n"; 99 cout << "Sampled negatives: "; 100 for(int id : negatives) cout << id << ' '; 101 cout << "\n"; 102 return 0; 103 } 104
Implements a single SGD update for skip-gram with negative sampling. The positive pair (t, c) is pulled together, while K negatives are pushed away. We use a numerically stable sigmoid and sample negatives from a frequency-smoothed distribution. The printed dot product typically increases after the update, indicating stronger alignment between target and true context.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Project N vectors in R^D to R^m using a dense Gaussian random matrix R (D x m) 5 vector<vector<double>> random_projection(const vector<vector<double>>& X, int m, unsigned seed=123){ 6 int N = (int)X.size(); 7 if(N == 0) return {}; 8 int D = (int)X[0].size(); 9 mt19937 rng(seed); 10 normal_distribution<double> N01(0.0, 1.0); 11 // R scaled by 1/sqrt(m) 12 vector<vector<double>> R(D, vector<double>(m)); 13 double scale = 1.0 / sqrt((double)m); 14 for(int i=0;i<D;++i) for(int j=0;j<m;++j) R[i][j] = N01(rng) * scale; 15 16 vector<vector<double>> Z(N, vector<double>(m, 0.0)); 17 for(int n=0;n<N;++n){ 18 for(int j=0;j<m;++j){ 19 double s = 0.0; 20 for(int i=0;i<D;++i) s += X[n][i] * R[i][j]; 21 Z[n][j] = s; 22 } 23 } 24 return Z; 25 } 26 27 // Utility: cosine similarity 28 double cosine(const vector<double>& a, const vector<double>& b){ 29 auto dot = inner_product(a.begin(), a.end(), b.begin(), 0.0); 30 auto na = sqrt(inner_product(a.begin(), a.end(), a.begin(), 0.0)); 31 auto nb = sqrt(inner_product(b.begin(), b.end(), b.begin(), 0.0)); 32 if(na == 0 || nb == 0) return 0.0; 33 return dot / (na*nb); 34 } 35 36 int main(){ 37 ios::sync_with_stdio(false); 38 cin.tie(nullptr); 39 40 // Build N synthetic D-dimensional vectors 41 int N = 200, D = 128, m = 32; 42 mt19937 rng(7); 43 normal_distribution<double> N01(0.0, 1.0); 44 vector<vector<double>> X(N, vector<double>(D)); 45 for(int n=0;n<N;++n) for(int i=0;i<D;++i) X[n][i] = N01(rng); 46 47 // Compute a few pairwise cosines before projection 48 vector<pair<int,int>> pairs = {{0,1},{0,2},{10,11},{50,120}}; 49 vector<double> before; 50 for(auto [i,j]: pairs) before.push_back(cosine(X[i], X[j])); 51 52 // Project down to m dimensions 53 auto Z = random_projection(X, m, 123); 54 55 // Compute the same cosines after projection 56 vector<double> after; 57 for(auto [i,j]: pairs) after.push_back(cosine(Z[i], Z[j])); 58 59 cout << fixed << setprecision(5); 60 for(size_t k=0;k<pairs.size();++k){ 61 cout << "Pair (" << pairs[k].first << "," << pairs[k].second << ")\tbefore=" 62 << before[k] << "\tafter=" << after[k] << "\n"; 63 } 64 65 return 0; 66 } 67
Generates synthetic high-dimensional vectors, then compresses them using a dense Gaussian random projection scaled by 1/sqrt(m). The Johnson–Lindenstrauss lemma guarantees approximate distance (and thus cosine) preservation with high probability for sufficiently large m. The printed cosine pairs illustrate that similarities are largely preserved after compression.