Representation Learning Theory
Key Points
- •Representation learning aims to automatically discover features that make downstream tasks easy, often without human-provided labels.
- •Disentanglement seeks representations where independent latent factors each occupy their own coordinate, enabling interpretability and compositional generalization.
- •Identifiability asks when the true latent factors can be recovered; linear ICA is identifiable up to permutation and scaling under non-Gaussianity, while nonlinear cases need inductive biases.
- •Contrastive learning trains encoders to pull positives together and push negatives apart, with InfoNCE providing a mutual-information lower bound.
- •Self-supervised learning creates supervision from pretext tasks (e.g., predicting rotations or masked tokens) to learn broadly useful features.
- •Good representations transfer: a single learned embedding can support many tasks via small adapters or linear probes.
- •Theory shows practical trade-offs: InfoNCE scales as O() with batch size, negatives affect bias/variance, and augmentations encode the intended invariances.
- •Deep networks often develop abstractions layer-by-layer, compressing nuisances while preserving task-relevant information.
Prerequisites
- →Linear Algebra (vectors, matrices, eigen-decomposition) — Representations are linear/nonlinear transformations of vectors; ICA and whitening rely on covariance and eigen-decomposition.
- →Probability and Random Variables — Independence, expectations, and distributions underpin MI, ICA assumptions, and contrastive sampling.
- →Information Theory Basics — Mutual information, data processing, and compression-accuracy trade-offs explain what representations retain or discard.
- →Optimization and Gradient Descent — Training encoders minimizes losses like InfoNCE via gradients.
- →Matrix Calculus (lightweight) — Deriving gradients of similarity scores with respect to linear maps helps implement and reason about training.
- →Numerical Stability (softmax, log-sum-exp) — Avoids overflow/underflow in contrastive objectives.
- →Neural Networks and Embeddings — Encoders map inputs to vectors where contrastive losses operate.
- →Statistical Estimation and Generalization — Bound interpretations (InfoNCE) and transfer evaluation rely on understanding bias/variance and overfitting.
Detailed Explanation
Tap terms for definitions01Overview
Representation learning is the study of how to transform raw data (images, audio, text, time series) into features that make learning and reasoning easier. Instead of hand-crafting features, we learn them directly from data by optimizing objectives that encourage invariances, sufficiency, and disentanglement. Classic theory begins with linear independent component analysis (ICA), which recovers statistically independent sources from their mixtures under mild assumptions. Modern deep learning extends this to nonlinear encoders trained with self-supervision: pretext tasks and contrastive learning teach models which variations to ignore and which to keep. Mutual information (MI) and its tractable bounds (such as InfoNCE) formalize how much dependence a representation keeps about its input or a paired view. Identifiability results tell us when latent factors can, in principle, be uniquely recovered. In practice, inductive biases—architectures, data augmentations, and objectives—steer learning toward useful solutions. This theoretical lens explains why contrastive objectives work, when disentangled factors are recoverable, and how transfer learning is possible: a good representation retains the right information to support many tasks with minimal additional training.
02Intuition & Analogies
Imagine organizing a messy garage. If you toss everything randomly into boxes, you’ll never find what you need. But if you sort by purpose—tools with tools, sports gear with sports gear—you can quickly solve many future tasks. Representations are like your sorting system: a good one groups together variations that shouldn’t matter (e.g., lighting changes in images) and separates differences that do (e.g., object identity). Disentanglement is like giving each shelf a single purpose—one shelf per factor—so you can flexibly recombine them (a red, shiny, small car corresponds to independent shelves for color, gloss, and size). Contrastive learning is the “spot the difference” game: take two slightly different views of the same item (positives) and many unrelated items (negatives); then learn a code that brings the two views together and pushes the unrelated ones apart. Self-supervised pretext tasks are chores that indirectly keep your garage tidy: labeling nothing by hand, you still get order by asking the system to predict rotations, fill in missing pieces, or match augmented views. Identifiability asks whether someone else, using your sorting rules, would rebuild essentially the same shelves—are the underlying categories uniquely recoverable? Linear ICA says yes (up to relabeling and rescaling) if the items (sources) are non-Gaussian. In nonlinear settings, you need extra rules (inductive biases)—like insisting all power tools go on a metal rack (architectural constraints) or that near-duplicate photos should land together (augmentations)—to get back consistent shelves.
03Formal Definition
04When to Use
Use representation learning when labels are scarce, tasks are numerous or evolving, and you expect shared structure across them. Contrastive learning is effective when you can define meaningful augmentations or paired signals (e.g., two crops of the same image, audio-video sync, text-audio pairs). Choose ICA or other linear methods when mixtures are approximately linear and sources are independent and non-Gaussian (e.g., simple signal separation). Apply self-supervised pretext tasks when domain priors are clear: masking for language and vision, next-step prediction for time series, rotation prediction for images. For transfer learning, pretrain broad encoders on large unlabeled corpora and evaluate with linear probes or few-shot fine-tuning. Prefer contrastive objectives when you can afford large batches or memory banks (to provide many negatives) and when invariances are well captured by augmentations; opt for non-contrastive or predictive methods when negatives are hard to define. In scientific discovery or structured data, seek inductive biases (architectures, causal assumptions) to make nonlinear latent factors more identifiable.
⚠️Common Mistakes
- Assuming disentanglement emerges “for free.” Without explicit inductive biases (architectures, sparsity, augmentations, group structure), nonlinear factors are generally unidentifiable.
- Misusing augmentations. If augmentations remove task-relevant information, the representation collapses or underperforms on downstream tasks.
- Confusing MI estimates with guarantees. InfoNCE provides a bound, not an exact MI value; comparing bounds across setups can be misleading due to estimator bias and temperature effects.
- Too few negatives in contrastive learning. Small effective negative sets increase variance and can overfit to shortcuts; batch size or memory banks matter.
- Ignoring symmetry in ICA. Sources are only identifiable up to permutation and scaling; evaluating with naive MSE can seem poor despite correct recovery modulo these symmetries.
- Overfitting pretext tasks. A model can exploit shortcuts (e.g., boundary artifacts) to solve the pretext objective without learning transferable features; use strong augmentations and diverse tasks.
- Neglecting normalization and stability. Contrastive losses need numerically stable softmax; ICA needs whitening and careful nonlinearity to avoid divergence.
- Forgetting evaluation protocols. Use linear probes, frozen-encoder transfer, and few-shot adaption to fairly assess representation quality.
Key Formulas
Mutual Information
Explanation: Mutual information measures how much knowing X reduces uncertainty about Y (and vice versa). It is zero if and only if X and Y are independent.
InfoNCE Loss
Explanation: This contrastive objective encourages the similarity of a positive pair to exceed those of K negatives. Lower loss means better separation of positives from negatives.
InfoNCE Bound
Explanation: With K negatives, InfoNCE provides a lower bound on mutual information between X and Y when positives are drawn from the joint and negatives from the product of marginals.
NT-Xent (Per-Anchor)
Explanation: For each anchor i among 2N views, the positive is its paired view i^+, and all other 2N-2 views are negatives. Cosine similarity is often used for \(\).
\(\beta\)-VAE Objective
Explanation: Increasing \(\) tightens the bottleneck, encouraging disentanglement by penalizing information flow from x to z. Too large \(\) harms reconstruction.
ICA Model
Explanation: Observations are linear mixtures of independent sources. Under non-Gaussianity (except at most one Gaussian), sources are identifiable up to permutation and scaling.
Whitening
Explanation: Whitening centers data and removes correlations so covariance becomes the identity. It stabilizes ICA and simplifies subsequent orthogonal unmixing.
FastICA Fixed-Point
Explanation: This update maximizes non-Gaussianity of the projection using a nonlinearity g (e.g., tanh). Normalization enforces unit norm; decorrelation yields multiple components.
Triplet Loss
Explanation: Encourages the anchor to be closer to the positive than to the negative by margin m. It is a non-probabilistic alternative to InfoNCE.
Temperature-Scaled Softmax
Explanation: Controls the sharpness of the probability distribution over similarities. Lower temperature emphasizes the largest scores and can increase gradient variance.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute dot product between two vectors 5 double dot(const vector<double>& a, const vector<double>& b){ 6 double s = 0.0; size_t d = a.size(); 7 for(size_t i=0;i<d;++i) s += a[i]*b[i]; 8 return s; 9 } 10 11 // Numerically stable log-sum-exp for a vector of scores divided by tau 12 double logsumexp(const vector<double>& scores, double tau){ 13 double m = -1e300; 14 for(double s : scores) m = max(m, s/tau); 15 double sum = 0.0; 16 for(double s : scores) sum += exp((s/tau) - m); 17 return m + log(max(sum, 1e-300)); 18 } 19 20 // InfoNCE loss: anchors = Z1, candidates = Z2; positive for i is Z2[i]. 21 // Similarity is dot product. Temperature tau > 0. 22 double infoNCE(const vector<vector<double>>& Z1, 23 const vector<vector<double>>& Z2, 24 double tau){ 25 size_t B = Z1.size(); 26 size_t d = Z1[0].size(); 27 (void)d; // unused if compiled with -Wall when not used 28 double loss = 0.0; 29 for(size_t i=0;i<B;++i){ 30 vector<double> scores(B); 31 for(size_t k=0;k<B;++k){ 32 scores[k] = dot(Z1[i], Z2[k]); // similarity 33 } 34 double lse = logsumexp(scores, tau); 35 double numerator = scores[i]/tau; // log exp(sim/tau) = sim/tau 36 loss += -(numerator - lse); 37 } 38 return loss / static_cast<double>(B); 39 } 40 41 int main(){ 42 ios::sync_with_stdio(false); 43 cin.tie(nullptr); 44 45 size_t B = 4, d = 3; 46 double tau = 0.2; 47 // Create two views: Z2 is Z1 with small noise so positives are aligned 48 vector<vector<double>> Z1(B, vector<double>(d)); 49 vector<vector<double>> Z2(B, vector<double>(d)); 50 std::mt19937 rng(42); 51 std::normal_distribution<double> norm(0.0, 1.0); 52 std::normal_distribution<double> small(0.0, 0.1); 53 54 for(size_t i=0;i<B;++i){ 55 for(size_t j=0;j<d;++j){ 56 Z1[i][j] = norm(rng); 57 } 58 } 59 for(size_t i=0;i<B;++i){ 60 for(size_t j=0;j<d;++j){ 61 Z2[i][j] = Z1[i][j] + small(rng); // positive pairs are close 62 } 63 } 64 65 double L = infoNCE(Z1, Z2, tau); 66 cout << fixed << setprecision(6); 67 cout << "InfoNCE loss: " << L << "\n"; 68 69 return 0; 70 } 71
This program computes the InfoNCE loss for a small batch with two views, assuming the i-th vector in Z1 matches the i-th in Z2. We use dot-product similarity and a temperature-scaled softmax computed via log-sum-exp for numerical stability. The positives are designed to be similar (Z2 is Z1 plus small noise), so the loss should be relatively low. This mirrors the core step in contrastive learning that pulls positives together and pushes against many negatives.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Mat2 { 5 double a11, a12, a21, a22; // row-major 2x2 6 }; 7 8 // y = W * x for 2D vectors 9 array<double,2> matvec(const Mat2& W, const array<double,2>& x){ 10 return {W.a11*x[0] + W.a12*x[1], W.a21*x[0] + W.a22*x[1]}; 11 } 12 13 // dot for 2D 14 inline double dot2(const array<double,2>& u, const array<double,2>& v){ 15 return u[0]*v[0] + u[1]*v[1]; 16 } 17 18 // Update W by gradient descent on InfoNCE where sim(i,k) = (W x_i) · y_k 19 // Gradient wrt W is sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T) 20 int main(){ 21 ios::sync_with_stdio(false); 22 cin.tie(nullptr); 23 24 // Synthetic data: base points x_i around a circle; positives y_i are noisy rotated versions 25 size_t B = 64; // batch size 26 size_t epochs = 200; 27 double tau = 0.2; 28 double lr = 0.05; 29 std::mt19937 rng(123); 30 std::uniform_real_distribution<double> uni(0.0, 2.0*M_PI); 31 std::normal_distribution<double> noise(0.0, 0.05); 32 33 vector<array<double,2>> X(B), Y(B); 34 for(size_t i=0;i<B;++i){ 35 double t = uni(rng); 36 double r = 1.0 + 0.1*noise(rng); 37 X[i] = {r*cos(t), r*sin(t)}; 38 // Augmentation: small rotation + noise 39 double dt = 0.1*noise(rng); 40 double c = cos(dt), s = sin(dt); 41 array<double,2> x = X[i]; 42 Y[i] = {c*x[0] - s*x[1] + noise(rng), s*x[0] + c*x[1] + noise(rng)}; 43 } 44 45 // Initialize linear projector W randomly 46 std::normal_distribution<double> initn(0.0, 0.5); 47 Mat2 W{initn(rng), initn(rng), initn(rng), initn(rng)}; 48 49 auto infoNCE_loss = [&](const Mat2& W)->double{ 50 double L=0.0; 51 for(size_t i=0;i<B;++i){ 52 array<double,2> z = matvec(W, X[i]); 53 vector<double> scores(B); 54 double m = -1e300; 55 for(size_t k=0;k<B;++k){ 56 scores[k] = dot2(z, Y[k]); 57 m = max(m, scores[k]/tau); 58 } 59 double denom=0.0; 60 for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m); 61 double lse = m + log(max(denom, 1e-300)); 62 L += -(scores[i]/tau - lse); 63 } 64 return L / (double)B; 65 }; 66 67 for(size_t ep=0; ep<epochs; ++ep){ 68 // Accumulate gradient dL/dW 69 double g11=0.0, g12=0.0, g21=0.0, g22=0.0; 70 double loss=0.0; 71 for(size_t i=0;i<B;++i){ 72 array<double,2> z = matvec(W, X[i]); 73 vector<double> scores(B); 74 double m = -1e300; 75 for(size_t k=0;k<B;++k){ 76 scores[k] = dot2(z, Y[k]); 77 m = max(m, scores[k]/tau); 78 } 79 double denom=0.0; 80 for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m); 81 vector<double> probs(B); 82 for(size_t k=0;k<B;++k) probs[k] = exp((scores[k]/tau) - m) / max(denom, 1e-300); 83 84 // Per-sample loss contribution 85 double lse = m + log(max(denom, 1e-300)); 86 loss += -(scores[i]/tau - lse); 87 88 // Gradient: sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T) / tau 89 for(size_t k=0;k<B;++k){ 90 double coeff = (probs[k] - (k==i ? 1.0 : 0.0)) / tau; 91 // y_k x_i^T adds to W's gradient 92 g11 += coeff * (Y[k][0] * X[i][0]); 93 g12 += coeff * (Y[k][0] * X[i][1]); 94 g21 += coeff * (Y[k][1] * X[i][0]); 95 g22 += coeff * (Y[k][1] * X[i][1]); 96 } 97 } 98 loss /= (double)B; 99 // Gradient step (average over batch) 100 W.a11 -= lr * (g11 / (double)B); 101 W.a12 -= lr * (g12 / (double)B); 102 W.a21 -= lr * (g21 / (double)B); 103 W.a22 -= lr * (g22 / (double)B); 104 105 if((ep+1)%50==0){ 106 cout << "Epoch " << (ep+1) << ": loss = " << fixed << setprecision(6) << loss << "\n"; 107 } 108 } 109 110 // Final loss 111 cout << "Final InfoNCE loss: " << fixed << setprecision(6) << infoNCE_loss(W) << "\n"; 112 return 0; 113 } 114
We generate 2D points on a noisy circle and create positive pairs by applying a small random rotation and noise. We learn a 2×2 linear projection W so that (W x_i) is close (by dot product) to its positive y_i and far from other y_k via InfoNCE. The gradient of the dot similarity with respect to W is y_k x_i^T, making the implementation simple. The printed loss should decrease over epochs, illustrating how contrastive objectives shape a representation even with a tiny linear model.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Vec2 { double x, y; }; 5 struct Mat2 { double a,b,c,d; /* [a b; c d] */ }; 6 7 Vec2 operator+(const Vec2& u, const Vec2& v){ return {u.x+v.x, u.y+v.y}; } 8 Vec2 operator-(const Vec2& u, const Vec2& v){ return {u.x-v.x, u.y-v.y}; } 9 Vec2 operator*(double s, const Vec2& u){ return {s*u.x, s*u.y}; } 10 Vec2 operator*(const Mat2& M, const Vec2& v){ return {M.a*v.x + M.b*v.y, M.c*v.x + M.d*v.y}; } 11 Mat2 operator*(const Mat2& A, const Mat2& B){ return {A.a*B.a + A.b*B.c, A.a*B.b + A.b*B.d, A.c*B.a + A.d*B.c, A.c*B.b + A.d*B.d}; } 12 Mat2 transpose(const Mat2& M){ return {M.a, M.c, M.b, M.d}; } 13 14 double dot(const Vec2& u, const Vec2& v){ return u.x*v.x + u.y*v.y; } 15 Vec2 normalize(const Vec2& v){ double n = sqrt(max(1e-12, v.x*v.x + v.y*v.y)); return {v.x/n, v.y/n}; } 16 17 // Compute mean of vectors 18 Vec2 meanv(const vector<Vec2>& X){ Vec2 m{0,0}; for(auto& x: X){ m.x+=x.x; m.y+=x.y; } m.x/=X.size(); m.y/=X.size(); return m; } 19 20 // Covariance of centered data 21 Mat2 cov(const vector<Vec2>& X){ 22 double sxx=0, sxy=0, syy=0; size_t n=X.size(); 23 for(auto& x: X){ sxx += x.x*x.x; sxy += x.x*x.y; syy += x.y*x.y; } 24 sxx/=n; sxy/=n; syy/=n; 25 return {sxx, sxy, sxy, syy}; 26 } 27 28 // Eigen-decomposition of symmetric 2x2 [[a,b],[b,c]] 29 void eig2_sym(double a, double b, double c, double &l1, double &l2, Mat2 &V){ 30 double tr = a + c; 31 double det = a*c - b*b; 32 double temp = sqrt(max(0.0, tr*tr/4.0 - det)); 33 l1 = tr/2.0 + temp; l2 = tr/2.0 - temp; 34 // Eigenvector for l1: (b, l1 - a) if b != 0, else (1,0) 35 Vec2 v1 = (fabs(b) > 1e-12) ? Vec2{b, l1 - a} : Vec2{1.0, 0.0}; 36 Vec2 v2 = (fabs(b) > 1e-12) ? Vec2{b, l2 - a} : Vec2{0.0, 1.0}; 37 v1 = normalize(v1); v2 = normalize(v2); 38 // Orthonormalize: ensure right-hand basis 39 // Make v2 orthogonal to v1 if numerical issues 40 double proj = v1.x*v2.x + v1.y*v2.y; 41 v2 = normalize(Vec2{v2.x - proj*v1.x, v2.y - proj*v1.y}); 42 V = {v1.x, v2.x, v1.y, v2.y}; // columns are eigenvectors 43 } 44 45 // Whitening matrix: W = V * diag(1/sqrt(l)) * V^T 46 Mat2 whitening_from_cov(const Mat2& C){ 47 double l1,l2; Mat2 V; 48 eig2_sym(C.a, C.b, C.d, l1, l2, V); 49 double s1 = 1.0 / sqrt(max(l1, 1e-12)); 50 double s2 = 1.0 / sqrt(max(l2, 1e-12)); 51 Mat2 D{ s1, 0.0, 0.0, s2 }; 52 Mat2 VT = transpose(V); 53 return V * D * VT; 54 } 55 56 int main(){ 57 ios::sync_with_stdio(false); 58 cin.tie(nullptr); 59 60 // Generate independent non-Gaussian sources (Laplace and Uniform) 61 size_t n = 5000; 62 std::mt19937 rng(7); 63 std::exponential_distribution<double> expo(1.0); // Laplace via difference of exponentials 64 std::uniform_real_distribution<double> uni(-1.0, 1.0); 65 vector<Vec2> S(n); 66 for(size_t i=0;i<n;++i){ 67 double lap = (expo(rng) - expo(rng)); 68 double uni0 = uni(rng); 69 S[i] = {lap, uni0}; 70 } 71 72 // Mixing matrix A (invertible) 73 Mat2 A{1.0, 0.7, -0.4, 1.2}; 74 vector<Vec2> X(n); 75 for(size_t i=0;i<n;++i){ X[i] = A * S[i]; } 76 77 // Center 78 Vec2 mX = meanv(X); 79 for(auto& x: X){ x = x - mX; } 80 81 // Whiten 82 Mat2 C = cov(X); 83 Mat2 Wwhite = whitening_from_cov(C); 84 vector<Vec2> Xw(n); 85 for(size_t i=0;i<n;++i){ Xw[i] = Wwhite * X[i]; } 86 87 // FastICA with g(u)=tanh(u), g'(u)=1 - tanh^2(u) 88 auto fastica_component = [&](const vector<Vec2>& Xw, const Vec2& w_init, const Vec2* w_orth) { 89 Vec2 w = normalize(w_init); 90 for(int it=0; it<200; ++it){ 91 Vec2 Egx{0,0}; double Egp = 0.0; // E[x g(w^T x)], E[g'(w^T x)] 92 for(const auto& x : Xw){ 93 double u = w.x*x.x + w.y*x.y; 94 double g = tanh(u); 95 double gp = 1.0 - g*g; 96 Egx.x += x.x * g; Egx.y += x.y * g; 97 Egp += gp; 98 } 99 double invn = 1.0 / (double)Xw.size(); 100 Egx.x *= invn; Egx.y *= invn; Egp *= invn; 101 Vec2 w_new{Egx.x - Egp*w.x, Egx.y - Egp*w.y}; 102 // Decorrelate from previous component if provided 103 if(w_orth){ 104 double proj = w_new.x*w_orth->x + w_new.y*w_orth->y; 105 w_new.x -= proj * w_orth->x; 106 w_new.y -= proj * w_orth->y; 107 } 108 w = normalize(w_new); 109 } 110 return w; 111 }; 112 113 // Estimate two components 114 Vec2 w1 = fastica_component(Xw, Vec2{1.0, 0.0}, nullptr); 115 Vec2 w2 = fastica_component(Xw, Vec2{0.0, 1.0}, &w1); 116 117 // Recover sources: s_hat = [w1; w2] * xw 118 vector<Vec2> Sh(n); 119 for(size_t i=0;i<n;++i){ 120 double s1 = w1.x*Xw[i].x + w1.y*Xw[i].y; 121 double s2 = w2.x*Xw[i].x + w2.y*Xw[i].y; 122 Sh[i] = {s1, s2}; 123 } 124 125 // Evaluate by absolute correlation (up to permutation & scaling) 126 auto corr = [&](const vector<double>& a, const vector<double>& b){ 127 size_t N = a.size(); 128 double ma=0, mb=0; for(size_t i=0;i<N;++i){ ma+=a[i]; mb+=b[i]; } ma/=N; mb/=N; 129 double va=0, vb=0, cab=0; 130 for(size_t i=0;i<N;++i){ double da=a[i]-ma, db=b[i]-mb; va+=da*da; vb+=db*db; cab+=da*db; } 131 return cab / sqrt(max(1e-12, va*vb)); 132 }; 133 134 vector<double> s1(n), s2(n), h1(n), h2(n); 135 for(size_t i=0;i<n;++i){ s1[i]=S[i].x; s2[i]=S[i].y; h1[i]=Sh[i].x; h2[i]=Sh[i].y; } 136 137 double c11 = fabs(corr(s1,h1)), c12 = fabs(corr(s1,h2)); 138 double c21 = fabs(corr(s2,h1)), c22 = fabs(corr(s2,h2)); 139 140 cout << fixed << setprecision(4); 141 cout << "|corr(s1, h1)| = " << c11 << ", |corr(s1, h2)| = " << c12 << "\n"; 142 cout << "|corr(s2, h1)| = " << c21 << ", |corr(s2, h2)| = " << c22 << "\n"; 143 cout << "(Expect each source to align strongly with one recovered component.)\n"; 144 145 return 0; 146 } 147
This demo generates two independent sources (Laplace and Uniform), mixes them linearly, and then applies whitening followed by 2D FastICA. The fixed-point update with g(u)=tanh(u) maximizes non-Gaussianity, and orthogonalization enforces independent components. The printed absolute correlations show that each true source aligns with one recovered component (up to permutation and scaling), illustrating linear identifiability in ICA.