📚TheoryAdvanced

Representation Learning Theory

Key Points

  • Representation learning aims to automatically discover features that make downstream tasks easy, often without human-provided labels.
  • Disentanglement seeks representations where independent latent factors each occupy their own coordinate, enabling interpretability and compositional generalization.
  • Identifiability asks when the true latent factors can be recovered; linear ICA is identifiable up to permutation and scaling under non-Gaussianity, while nonlinear cases need inductive biases.
  • Contrastive learning trains encoders to pull positives together and push negatives apart, with InfoNCE providing a mutual-information lower bound.
  • Self-supervised learning creates supervision from pretext tasks (e.g., predicting rotations or masked tokens) to learn broadly useful features.
  • Good representations transfer: a single learned embedding can support many tasks via small adapters or linear probes.
  • Theory shows practical trade-offs: InfoNCE scales as O() with batch size, negatives affect bias/variance, and augmentations encode the intended invariances.
  • Deep networks often develop abstractions layer-by-layer, compressing nuisances while preserving task-relevant information.

Prerequisites

  • Linear Algebra (vectors, matrices, eigen-decomposition)Representations are linear/nonlinear transformations of vectors; ICA and whitening rely on covariance and eigen-decomposition.
  • Probability and Random VariablesIndependence, expectations, and distributions underpin MI, ICA assumptions, and contrastive sampling.
  • Information Theory BasicsMutual information, data processing, and compression-accuracy trade-offs explain what representations retain or discard.
  • Optimization and Gradient DescentTraining encoders minimizes losses like InfoNCE via gradients.
  • Matrix Calculus (lightweight)Deriving gradients of similarity scores with respect to linear maps helps implement and reason about training.
  • Numerical Stability (softmax, log-sum-exp)Avoids overflow/underflow in contrastive objectives.
  • Neural Networks and EmbeddingsEncoders map inputs to vectors where contrastive losses operate.
  • Statistical Estimation and GeneralizationBound interpretations (InfoNCE) and transfer evaluation rely on understanding bias/variance and overfitting.

Detailed Explanation

Tap terms for definitions

01Overview

Representation learning is the study of how to transform raw data (images, audio, text, time series) into features that make learning and reasoning easier. Instead of hand-crafting features, we learn them directly from data by optimizing objectives that encourage invariances, sufficiency, and disentanglement. Classic theory begins with linear independent component analysis (ICA), which recovers statistically independent sources from their mixtures under mild assumptions. Modern deep learning extends this to nonlinear encoders trained with self-supervision: pretext tasks and contrastive learning teach models which variations to ignore and which to keep. Mutual information (MI) and its tractable bounds (such as InfoNCE) formalize how much dependence a representation keeps about its input or a paired view. Identifiability results tell us when latent factors can, in principle, be uniquely recovered. In practice, inductive biases—architectures, data augmentations, and objectives—steer learning toward useful solutions. This theoretical lens explains why contrastive objectives work, when disentangled factors are recoverable, and how transfer learning is possible: a good representation retains the right information to support many tasks with minimal additional training.

02Intuition & Analogies

Imagine organizing a messy garage. If you toss everything randomly into boxes, you’ll never find what you need. But if you sort by purpose—tools with tools, sports gear with sports gear—you can quickly solve many future tasks. Representations are like your sorting system: a good one groups together variations that shouldn’t matter (e.g., lighting changes in images) and separates differences that do (e.g., object identity). Disentanglement is like giving each shelf a single purpose—one shelf per factor—so you can flexibly recombine them (a red, shiny, small car corresponds to independent shelves for color, gloss, and size). Contrastive learning is the “spot the difference” game: take two slightly different views of the same item (positives) and many unrelated items (negatives); then learn a code that brings the two views together and pushes the unrelated ones apart. Self-supervised pretext tasks are chores that indirectly keep your garage tidy: labeling nothing by hand, you still get order by asking the system to predict rotations, fill in missing pieces, or match augmented views. Identifiability asks whether someone else, using your sorting rules, would rebuild essentially the same shelves—are the underlying categories uniquely recoverable? Linear ICA says yes (up to relabeling and rescaling) if the items (sources) are non-Gaussian. In nonlinear settings, you need extra rules (inductive biases)—like insisting all power tools go on a metal rack (architectural constraints) or that near-duplicate photos should land together (augmentations)—to get back consistent shelves.

03Formal Definition

Let X denote observed data and ) a learned representation. A representation is useful if it preserves task-relevant information while discarding nuisances. Mutual information \(I(X;Z)\) quantifies statistical dependence; sufficiency for a task Y means \(I(Y;X Z) = 0\), i.e., Z contains all information in X relevant to Y. Disentanglement informally requires coordinates of Z to align with independent generative factors \(, , \), often measured via independence and predictability metrics. Identifiability of a latent variable model (e.g., ICA) asks whether parameters are uniquely determined (up to symmetries) by the data distribution. In linear ICA, \( S\) with independent, non-Gaussian components S; under mild conditions, sources are recoverable up to permutation and scaling. Contrastive learning defines an encoder \(f_\) and similarity \(s(z,z')\), optimizing a loss like InfoNCE that favors high similarity for positive pairs and low for negatives. Self-supervised objectives derive signals from the data itself—augmentations, masking, or context—to guide \(f_\). Transfer learning evaluates whether Z generalizes: linear probes or small fine-tuning on new tasks quantify how much useful structure Z retained.

04When to Use

Use representation learning when labels are scarce, tasks are numerous or evolving, and you expect shared structure across them. Contrastive learning is effective when you can define meaningful augmentations or paired signals (e.g., two crops of the same image, audio-video sync, text-audio pairs). Choose ICA or other linear methods when mixtures are approximately linear and sources are independent and non-Gaussian (e.g., simple signal separation). Apply self-supervised pretext tasks when domain priors are clear: masking for language and vision, next-step prediction for time series, rotation prediction for images. For transfer learning, pretrain broad encoders on large unlabeled corpora and evaluate with linear probes or few-shot fine-tuning. Prefer contrastive objectives when you can afford large batches or memory banks (to provide many negatives) and when invariances are well captured by augmentations; opt for non-contrastive or predictive methods when negatives are hard to define. In scientific discovery or structured data, seek inductive biases (architectures, causal assumptions) to make nonlinear latent factors more identifiable.

⚠️Common Mistakes

  • Assuming disentanglement emerges “for free.” Without explicit inductive biases (architectures, sparsity, augmentations, group structure), nonlinear factors are generally unidentifiable.
  • Misusing augmentations. If augmentations remove task-relevant information, the representation collapses or underperforms on downstream tasks.
  • Confusing MI estimates with guarantees. InfoNCE provides a bound, not an exact MI value; comparing bounds across setups can be misleading due to estimator bias and temperature effects.
  • Too few negatives in contrastive learning. Small effective negative sets increase variance and can overfit to shortcuts; batch size or memory banks matter.
  • Ignoring symmetry in ICA. Sources are only identifiable up to permutation and scaling; evaluating with naive MSE can seem poor despite correct recovery modulo these symmetries.
  • Overfitting pretext tasks. A model can exploit shortcuts (e.g., boundary artifacts) to solve the pretext objective without learning transferable features; use strong augmentations and diverse tasks.
  • Neglecting normalization and stability. Contrastive losses need numerically stable softmax; ICA needs whitening and careful nonlinearity to avoid divergence.
  • Forgetting evaluation protocols. Use linear probes, frozen-encoder transfer, and few-shot adaption to fairly assess representation quality.

Key Formulas

Mutual Information

Explanation: Mutual information measures how much knowing X reduces uncertainty about Y (and vice versa). It is zero if and only if X and Y are independent.

InfoNCE Loss

Explanation: This contrastive objective encourages the similarity of a positive pair to exceed those of K negatives. Lower loss means better separation of positives from negatives.

InfoNCE Bound

Explanation: With K negatives, InfoNCE provides a lower bound on mutual information between X and Y when positives are drawn from the joint and negatives from the product of marginals.

NT-Xent (Per-Anchor)

Explanation: For each anchor i among 2N views, the positive is its paired view i^+, and all other 2N-2 views are negatives. Cosine similarity is often used for \(\).

\(\beta\)-VAE Objective

Explanation: Increasing \(\) tightens the bottleneck, encouraging disentanglement by penalizing information flow from x to z. Too large \(\) harms reconstruction.

ICA Model

Explanation: Observations are linear mixtures of independent sources. Under non-Gaussianity (except at most one Gaussian), sources are identifiable up to permutation and scaling.

Whitening

Explanation: Whitening centers data and removes correlations so covariance becomes the identity. It stabilizes ICA and simplifies subsequent orthogonal unmixing.

FastICA Fixed-Point

Explanation: This update maximizes non-Gaussianity of the projection using a nonlinearity g (e.g., tanh). Normalization enforces unit norm; decorrelation yields multiple components.

Triplet Loss

Explanation: Encourages the anchor to be closer to the positive than to the negative by margin m. It is a non-probabilistic alternative to InfoNCE.

Temperature-Scaled Softmax

Explanation: Controls the sharpness of the probability distribution over similarities. Lower temperature emphasizes the largest scores and can increase gradient variance.

Complexity Analysis

Contrastive learning with InfoNCE computes, for each anchor, similarities to all negatives in the batch (or memory bank). With batch size B and embedding dimension d, computing the similarity matrix costs O( d) time and O() memory if fully materialized; a streaming softmax reduces memory to O(B). The per-step backward pass has the same asymptotic cost since gradients propagate through all pairwise scores. If an encoder has L layers and per-sample cost O(C), the full step is O(B·C + d). This quadratic term often dominates, motivating large-batch systems and approximate negative sampling. Linear ICA on n samples of dimension d after centering and whitening has costs dominated by covariance estimation O(n) and eigen-decomposition O() (d is small in many settings). FastICA’s fixed-point iterations require O(T·nd) per component (T iterations), with decorrelation steps O() per update. Memory is O(nd) to store data plus O() for covariance and basis vectors. For the simplified C++ demos: (1) computing InfoNCE on two B×d matrices uses O( d) time and O(B d) memory if the similarity matrix is not stored; (2) training a small linear projection with InfoNCE uses O(E· d) time for E epochs; (3) 2D FastICA runs in O(n) per iteration with negligible constants. In all cases, numerical stability (log-sum-exp), careful normalization (for cosine similarities or whitening), and appropriate batch sizes are crucial to avoid overflow, slow convergence, or collapse.

Code Examples

Compute InfoNCE loss for a batch of embeddings (two views, dot-similarity)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute dot product between two vectors
5double dot(const vector<double>& a, const vector<double>& b){
6 double s = 0.0; size_t d = a.size();
7 for(size_t i=0;i<d;++i) s += a[i]*b[i];
8 return s;
9}
10
11// Numerically stable log-sum-exp for a vector of scores divided by tau
12double logsumexp(const vector<double>& scores, double tau){
13 double m = -1e300;
14 for(double s : scores) m = max(m, s/tau);
15 double sum = 0.0;
16 for(double s : scores) sum += exp((s/tau) - m);
17 return m + log(max(sum, 1e-300));
18}
19
20// InfoNCE loss: anchors = Z1, candidates = Z2; positive for i is Z2[i].
21// Similarity is dot product. Temperature tau > 0.
22double infoNCE(const vector<vector<double>>& Z1,
23 const vector<vector<double>>& Z2,
24 double tau){
25 size_t B = Z1.size();
26 size_t d = Z1[0].size();
27 (void)d; // unused if compiled with -Wall when not used
28 double loss = 0.0;
29 for(size_t i=0;i<B;++i){
30 vector<double> scores(B);
31 for(size_t k=0;k<B;++k){
32 scores[k] = dot(Z1[i], Z2[k]); // similarity
33 }
34 double lse = logsumexp(scores, tau);
35 double numerator = scores[i]/tau; // log exp(sim/tau) = sim/tau
36 loss += -(numerator - lse);
37 }
38 return loss / static_cast<double>(B);
39}
40
41int main(){
42 ios::sync_with_stdio(false);
43 cin.tie(nullptr);
44
45 size_t B = 4, d = 3;
46 double tau = 0.2;
47 // Create two views: Z2 is Z1 with small noise so positives are aligned
48 vector<vector<double>> Z1(B, vector<double>(d));
49 vector<vector<double>> Z2(B, vector<double>(d));
50 std::mt19937 rng(42);
51 std::normal_distribution<double> norm(0.0, 1.0);
52 std::normal_distribution<double> small(0.0, 0.1);
53
54 for(size_t i=0;i<B;++i){
55 for(size_t j=0;j<d;++j){
56 Z1[i][j] = norm(rng);
57 }
58 }
59 for(size_t i=0;i<B;++i){
60 for(size_t j=0;j<d;++j){
61 Z2[i][j] = Z1[i][j] + small(rng); // positive pairs are close
62 }
63 }
64
65 double L = infoNCE(Z1, Z2, tau);
66 cout << fixed << setprecision(6);
67 cout << "InfoNCE loss: " << L << "\n";
68
69 return 0;
70}
71

This program computes the InfoNCE loss for a small batch with two views, assuming the i-th vector in Z1 matches the i-th in Z2. We use dot-product similarity and a temperature-scaled softmax computed via log-sum-exp for numerical stability. The positives are designed to be similar (Z2 is Z1 plus small noise), so the loss should be relatively low. This mirrors the core step in contrastive learning that pulls positives together and pushes against many negatives.

Time: O(B^2 d) for a B×d batch (pairwise similarities per anchor).Space: O(B d) to store inputs; O(B) temporary storage per anchor.
Train a tiny linear projection with InfoNCE on synthetic pairs (simplified SimCLR)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Mat2 {
5 double a11, a12, a21, a22; // row-major 2x2
6};
7
8// y = W * x for 2D vectors
9array<double,2> matvec(const Mat2& W, const array<double,2>& x){
10 return {W.a11*x[0] + W.a12*x[1], W.a21*x[0] + W.a22*x[1]};
11}
12
13// dot for 2D
14inline double dot2(const array<double,2>& u, const array<double,2>& v){
15 return u[0]*v[0] + u[1]*v[1];
16}
17
18// Update W by gradient descent on InfoNCE where sim(i,k) = (W x_i) · y_k
19// Gradient wrt W is sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T)
20int main(){
21 ios::sync_with_stdio(false);
22 cin.tie(nullptr);
23
24 // Synthetic data: base points x_i around a circle; positives y_i are noisy rotated versions
25 size_t B = 64; // batch size
26 size_t epochs = 200;
27 double tau = 0.2;
28 double lr = 0.05;
29 std::mt19937 rng(123);
30 std::uniform_real_distribution<double> uni(0.0, 2.0*M_PI);
31 std::normal_distribution<double> noise(0.0, 0.05);
32
33 vector<array<double,2>> X(B), Y(B);
34 for(size_t i=0;i<B;++i){
35 double t = uni(rng);
36 double r = 1.0 + 0.1*noise(rng);
37 X[i] = {r*cos(t), r*sin(t)};
38 // Augmentation: small rotation + noise
39 double dt = 0.1*noise(rng);
40 double c = cos(dt), s = sin(dt);
41 array<double,2> x = X[i];
42 Y[i] = {c*x[0] - s*x[1] + noise(rng), s*x[0] + c*x[1] + noise(rng)};
43 }
44
45 // Initialize linear projector W randomly
46 std::normal_distribution<double> initn(0.0, 0.5);
47 Mat2 W{initn(rng), initn(rng), initn(rng), initn(rng)};
48
49 auto infoNCE_loss = [&](const Mat2& W)->double{
50 double L=0.0;
51 for(size_t i=0;i<B;++i){
52 array<double,2> z = matvec(W, X[i]);
53 vector<double> scores(B);
54 double m = -1e300;
55 for(size_t k=0;k<B;++k){
56 scores[k] = dot2(z, Y[k]);
57 m = max(m, scores[k]/tau);
58 }
59 double denom=0.0;
60 for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
61 double lse = m + log(max(denom, 1e-300));
62 L += -(scores[i]/tau - lse);
63 }
64 return L / (double)B;
65 };
66
67 for(size_t ep=0; ep<epochs; ++ep){
68 // Accumulate gradient dL/dW
69 double g11=0.0, g12=0.0, g21=0.0, g22=0.0;
70 double loss=0.0;
71 for(size_t i=0;i<B;++i){
72 array<double,2> z = matvec(W, X[i]);
73 vector<double> scores(B);
74 double m = -1e300;
75 for(size_t k=0;k<B;++k){
76 scores[k] = dot2(z, Y[k]);
77 m = max(m, scores[k]/tau);
78 }
79 double denom=0.0;
80 for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
81 vector<double> probs(B);
82 for(size_t k=0;k<B;++k) probs[k] = exp((scores[k]/tau) - m) / max(denom, 1e-300);
83
84 // Per-sample loss contribution
85 double lse = m + log(max(denom, 1e-300));
86 loss += -(scores[i]/tau - lse);
87
88 // Gradient: sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T) / tau
89 for(size_t k=0;k<B;++k){
90 double coeff = (probs[k] - (k==i ? 1.0 : 0.0)) / tau;
91 // y_k x_i^T adds to W's gradient
92 g11 += coeff * (Y[k][0] * X[i][0]);
93 g12 += coeff * (Y[k][0] * X[i][1]);
94 g21 += coeff * (Y[k][1] * X[i][0]);
95 g22 += coeff * (Y[k][1] * X[i][1]);
96 }
97 }
98 loss /= (double)B;
99 // Gradient step (average over batch)
100 W.a11 -= lr * (g11 / (double)B);
101 W.a12 -= lr * (g12 / (double)B);
102 W.a21 -= lr * (g21 / (double)B);
103 W.a22 -= lr * (g22 / (double)B);
104
105 if((ep+1)%50==0){
106 cout << "Epoch " << (ep+1) << ": loss = " << fixed << setprecision(6) << loss << "\n";
107 }
108 }
109
110 // Final loss
111 cout << "Final InfoNCE loss: " << fixed << setprecision(6) << infoNCE_loss(W) << "\n";
112 return 0;
113}
114

We generate 2D points on a noisy circle and create positive pairs by applying a small random rotation and noise. We learn a 2×2 linear projection W so that (W x_i) is close (by dot product) to its positive y_i and far from other y_k via InfoNCE. The gradient of the dot similarity with respect to W is y_k x_i^T, making the implementation simple. The printed loss should decrease over epochs, illustrating how contrastive objectives shape a representation even with a tiny linear model.

Time: O(E · B^2 d) with d=2 and E epochs (pairwise scores per anchor per epoch).Space: O(B d) for data plus O(B) for softmax probabilities.
2D FastICA on synthetic mixtures (recover independent sources)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Vec2 { double x, y; };
5struct Mat2 { double a,b,c,d; /* [a b; c d] */ };
6
7Vec2 operator+(const Vec2& u, const Vec2& v){ return {u.x+v.x, u.y+v.y}; }
8Vec2 operator-(const Vec2& u, const Vec2& v){ return {u.x-v.x, u.y-v.y}; }
9Vec2 operator*(double s, const Vec2& u){ return {s*u.x, s*u.y}; }
10Vec2 operator*(const Mat2& M, const Vec2& v){ return {M.a*v.x + M.b*v.y, M.c*v.x + M.d*v.y}; }
11Mat2 operator*(const Mat2& A, const Mat2& B){ return {A.a*B.a + A.b*B.c, A.a*B.b + A.b*B.d, A.c*B.a + A.d*B.c, A.c*B.b + A.d*B.d}; }
12Mat2 transpose(const Mat2& M){ return {M.a, M.c, M.b, M.d}; }
13
14double dot(const Vec2& u, const Vec2& v){ return u.x*v.x + u.y*v.y; }
15Vec2 normalize(const Vec2& v){ double n = sqrt(max(1e-12, v.x*v.x + v.y*v.y)); return {v.x/n, v.y/n}; }
16
17// Compute mean of vectors
18Vec2 meanv(const vector<Vec2>& X){ Vec2 m{0,0}; for(auto& x: X){ m.x+=x.x; m.y+=x.y; } m.x/=X.size(); m.y/=X.size(); return m; }
19
20// Covariance of centered data
21Mat2 cov(const vector<Vec2>& X){
22 double sxx=0, sxy=0, syy=0; size_t n=X.size();
23 for(auto& x: X){ sxx += x.x*x.x; sxy += x.x*x.y; syy += x.y*x.y; }
24 sxx/=n; sxy/=n; syy/=n;
25 return {sxx, sxy, sxy, syy};
26}
27
28// Eigen-decomposition of symmetric 2x2 [[a,b],[b,c]]
29void eig2_sym(double a, double b, double c, double &l1, double &l2, Mat2 &V){
30 double tr = a + c;
31 double det = a*c - b*b;
32 double temp = sqrt(max(0.0, tr*tr/4.0 - det));
33 l1 = tr/2.0 + temp; l2 = tr/2.0 - temp;
34 // Eigenvector for l1: (b, l1 - a) if b != 0, else (1,0)
35 Vec2 v1 = (fabs(b) > 1e-12) ? Vec2{b, l1 - a} : Vec2{1.0, 0.0};
36 Vec2 v2 = (fabs(b) > 1e-12) ? Vec2{b, l2 - a} : Vec2{0.0, 1.0};
37 v1 = normalize(v1); v2 = normalize(v2);
38 // Orthonormalize: ensure right-hand basis
39 // Make v2 orthogonal to v1 if numerical issues
40 double proj = v1.x*v2.x + v1.y*v2.y;
41 v2 = normalize(Vec2{v2.x - proj*v1.x, v2.y - proj*v1.y});
42 V = {v1.x, v2.x, v1.y, v2.y}; // columns are eigenvectors
43}
44
45// Whitening matrix: W = V * diag(1/sqrt(l)) * V^T
46Mat2 whitening_from_cov(const Mat2& C){
47 double l1,l2; Mat2 V;
48 eig2_sym(C.a, C.b, C.d, l1, l2, V);
49 double s1 = 1.0 / sqrt(max(l1, 1e-12));
50 double s2 = 1.0 / sqrt(max(l2, 1e-12));
51 Mat2 D{ s1, 0.0, 0.0, s2 };
52 Mat2 VT = transpose(V);
53 return V * D * VT;
54}
55
56int main(){
57 ios::sync_with_stdio(false);
58 cin.tie(nullptr);
59
60 // Generate independent non-Gaussian sources (Laplace and Uniform)
61 size_t n = 5000;
62 std::mt19937 rng(7);
63 std::exponential_distribution<double> expo(1.0); // Laplace via difference of exponentials
64 std::uniform_real_distribution<double> uni(-1.0, 1.0);
65 vector<Vec2> S(n);
66 for(size_t i=0;i<n;++i){
67 double lap = (expo(rng) - expo(rng));
68 double uni0 = uni(rng);
69 S[i] = {lap, uni0};
70 }
71
72 // Mixing matrix A (invertible)
73 Mat2 A{1.0, 0.7, -0.4, 1.2};
74 vector<Vec2> X(n);
75 for(size_t i=0;i<n;++i){ X[i] = A * S[i]; }
76
77 // Center
78 Vec2 mX = meanv(X);
79 for(auto& x: X){ x = x - mX; }
80
81 // Whiten
82 Mat2 C = cov(X);
83 Mat2 Wwhite = whitening_from_cov(C);
84 vector<Vec2> Xw(n);
85 for(size_t i=0;i<n;++i){ Xw[i] = Wwhite * X[i]; }
86
87 // FastICA with g(u)=tanh(u), g'(u)=1 - tanh^2(u)
88 auto fastica_component = [&](const vector<Vec2>& Xw, const Vec2& w_init, const Vec2* w_orth) {
89 Vec2 w = normalize(w_init);
90 for(int it=0; it<200; ++it){
91 Vec2 Egx{0,0}; double Egp = 0.0; // E[x g(w^T x)], E[g'(w^T x)]
92 for(const auto& x : Xw){
93 double u = w.x*x.x + w.y*x.y;
94 double g = tanh(u);
95 double gp = 1.0 - g*g;
96 Egx.x += x.x * g; Egx.y += x.y * g;
97 Egp += gp;
98 }
99 double invn = 1.0 / (double)Xw.size();
100 Egx.x *= invn; Egx.y *= invn; Egp *= invn;
101 Vec2 w_new{Egx.x - Egp*w.x, Egx.y - Egp*w.y};
102 // Decorrelate from previous component if provided
103 if(w_orth){
104 double proj = w_new.x*w_orth->x + w_new.y*w_orth->y;
105 w_new.x -= proj * w_orth->x;
106 w_new.y -= proj * w_orth->y;
107 }
108 w = normalize(w_new);
109 }
110 return w;
111 };
112
113 // Estimate two components
114 Vec2 w1 = fastica_component(Xw, Vec2{1.0, 0.0}, nullptr);
115 Vec2 w2 = fastica_component(Xw, Vec2{0.0, 1.0}, &w1);
116
117 // Recover sources: s_hat = [w1; w2] * xw
118 vector<Vec2> Sh(n);
119 for(size_t i=0;i<n;++i){
120 double s1 = w1.x*Xw[i].x + w1.y*Xw[i].y;
121 double s2 = w2.x*Xw[i].x + w2.y*Xw[i].y;
122 Sh[i] = {s1, s2};
123 }
124
125 // Evaluate by absolute correlation (up to permutation & scaling)
126 auto corr = [&](const vector<double>& a, const vector<double>& b){
127 size_t N = a.size();
128 double ma=0, mb=0; for(size_t i=0;i<N;++i){ ma+=a[i]; mb+=b[i]; } ma/=N; mb/=N;
129 double va=0, vb=0, cab=0;
130 for(size_t i=0;i<N;++i){ double da=a[i]-ma, db=b[i]-mb; va+=da*da; vb+=db*db; cab+=da*db; }
131 return cab / sqrt(max(1e-12, va*vb));
132 };
133
134 vector<double> s1(n), s2(n), h1(n), h2(n);
135 for(size_t i=0;i<n;++i){ s1[i]=S[i].x; s2[i]=S[i].y; h1[i]=Sh[i].x; h2[i]=Sh[i].y; }
136
137 double c11 = fabs(corr(s1,h1)), c12 = fabs(corr(s1,h2));
138 double c21 = fabs(corr(s2,h1)), c22 = fabs(corr(s2,h2));
139
140 cout << fixed << setprecision(4);
141 cout << "|corr(s1, h1)| = " << c11 << ", |corr(s1, h2)| = " << c12 << "\n";
142 cout << "|corr(s2, h1)| = " << c21 << ", |corr(s2, h2)| = " << c22 << "\n";
143 cout << "(Expect each source to align strongly with one recovered component.)\n";
144
145 return 0;
146}
147

This demo generates two independent sources (Laplace and Uniform), mixes them linearly, and then applies whitening followed by 2D FastICA. The fixed-point update with g(u)=tanh(u) maximizes non-Gaussianity, and orthogonalization enforces independent components. The printed absolute correlations show that each true source aligns with one recovered component (up to permutation and scaling), illustrating linear identifiability in ICA.

Time: O(n) per iteration (2D), dominated by computing expectations; plus O(1) for 2×2 eigendecomposition.Space: O(n) to store samples; O(1) for parameters and small matrices.