📚TheoryAdvanced

Representation Learning Theory

Key Points

•
Representation learning aims to automatically discover features that make downstream tasks easy, often without human-provided labels.
•
Disentanglement seeks representations where independent latent factors each occupy their own coordinate, enabling interpretability and compositional generalization.
•
Identifiability asks when the true latent factors can be recovered; linear ICA is identifiable up to permutation and scaling under non-Gaussianity, while nonlinear cases need inductive biases.
•
Contrastive learning trains encoders to pull positives together and push negatives apart, with InfoNCE providing a mutual-information lower bound.
•
Self-supervised learning creates supervision from pretext tasks (e.g., predicting rotations or masked tokens) to learn broadly useful features.
•
Good representations transfer: a single learned embedding can support many tasks via small adapters or linear probes.
•
Theory shows practical trade-offs: InfoNCE scales as O( $B^{2}$ ) with batch size, negatives affect bias/variance, and augmentations encode the intended invariances.
•
Deep networks often develop abstractions layer-by-layer, compressing nuisances while preserving task-relevant information.

Prerequisites

→Linear Algebra (vectors, matrices, eigen-decomposition) — Representations are linear/nonlinear transformations of vectors; ICA and whitening rely on covariance and eigen-decomposition.
→Probability and Random Variables — Independence, expectations, and distributions underpin MI, ICA assumptions, and contrastive sampling.
→Information Theory Basics — Mutual information, data processing, and compression-accuracy trade-offs explain what representations retain or discard.
→Optimization and Gradient Descent — Training encoders minimizes losses like InfoNCE via gradients.
→Matrix Calculus (lightweight) — Deriving gradients of similarity scores with respect to linear maps helps implement and reason about training.
→Numerical Stability (softmax, log-sum-exp) — Avoids overflow/underflow in contrastive objectives.
→Neural Networks and Embeddings — Encoders map inputs to vectors where contrastive losses operate.
→Statistical Estimation and Generalization — Bound interpretations (InfoNCE) and transfer evaluation rely on understanding bias/variance and overfitting.

Detailed Explanation

Tap terms for definitions

01Overview

Representation learning is the study of how to transform raw data (images, audio, text, time series) into features that make learning and reasoning easier. Instead of hand-crafting features, we learn them directly from data by optimizing objectives that encourage invariances, sufficiency, and disentanglement. Classic theory begins with linear independent component analysis (ICA), which recovers statistically independent sources from their mixtures under mild assumptions. Modern deep learning extends this to nonlinear encoders trained with self-supervision: pretext tasks and contrastive learning teach models which variations to ignore and which to keep. Mutual information (MI) and its tractable bounds (such as InfoNCE) formalize how much dependence a representation keeps about its input or a paired view. Identifiability results tell us when latent factors can, in principle, be uniquely recovered. In practice, inductive biases—architectures, data augmentations, and objectives—steer learning toward useful solutions. This theoretical lens explains why contrastive objectives work, when disentangled factors are recoverable, and how transfer learning is possible: a good representation retains the right information to support many tasks with minimal additional training.

02Intuition & Analogies

Imagine organizing a messy garage. If you toss everything randomly into boxes, you’ll never find what you need. But if you sort by purpose—tools with tools, sports gear with sports gear—you can quickly solve many future tasks. Representations are like your sorting system: a good one groups together variations that shouldn’t matter (e.g., lighting changes in images) and separates differences that do (e.g., object identity). Disentanglement is like giving each shelf a single purpose—one shelf per factor—so you can flexibly recombine them (a red, shiny, small car corresponds to independent shelves for color, gloss, and size). Contrastive learning is the “spot the difference” game: take two slightly different views of the same item (positives) and many unrelated items (negatives); then learn a code that brings the two views together and pushes the unrelated ones apart. Self-supervised pretext tasks are chores that indirectly keep your garage tidy: labeling nothing by hand, you still get order by asking the system to predict rotations, fill in missing pieces, or match augmented views. Identifiability asks whether someone else, using your sorting rules, would rebuild essentially the same shelves—are the underlying categories uniquely recoverable? Linear ICA says yes (up to relabeling and rescaling) if the items (sources) are non-Gaussian. In nonlinear settings, you need extra rules (inductive biases)—like insisting all power tools go on a metal rack (architectural constraints) or that near-duplicate photos should land together (augmentations)—to get back consistent shelves.

03Formal Definition

Let X denote observed data and

Z = f_{θ} (X

) a learned representation. A representation is useful if it preserves task-relevant information while discarding nuisances. Mutual information $I(X;Z)$ quantifies statistical dependence; sufficiency for a task Y means \(I(Y;X

∣

Z) = 0\), i.e., Z contains all information in X relevant to Y. Disentanglement informally requires coordinates of Z to align with independent generative factors \(

S_{1}

\dots

S_{k}

\), often measured via independence and predictability metrics. Identifiability of a latent variable model (e.g., ICA) asks whether parameters are uniquely determined (up to symmetries) by the data distribution. In linear ICA, \(

X = A

S\) with independent, non-Gaussian components S; under mild conditions, sources are recoverable up to permutation and scaling. Contrastive learning defines an encoder \(f_

θ

\) and similarity $s(z,z')$, optimizing a loss like InfoNCE that favors high similarity for positive pairs and low for negatives. Self-supervised objectives derive signals from the data itself—augmentations, masking, or context—to guide \(f_

θ

\). Transfer learning evaluates whether Z generalizes: linear probes or small fine-tuning on new tasks quantify how much useful structure Z retained.

04When to Use

Use representation learning when labels are scarce, tasks are numerous or evolving, and you expect shared structure across them. Contrastive learning is effective when you can define meaningful augmentations or paired signals (e.g., two crops of the same image, audio-video sync, text-audio pairs). Choose ICA or other linear methods when mixtures are approximately linear and sources are independent and non-Gaussian (e.g., simple signal separation). Apply self-supervised pretext tasks when domain priors are clear: masking for language and vision, next-step prediction for time series, rotation prediction for images. For transfer learning, pretrain broad encoders on large unlabeled corpora and evaluate with linear probes or few-shot fine-tuning. Prefer contrastive objectives when you can afford large batches or memory banks (to provide many negatives) and when invariances are well captured by augmentations; opt for non-contrastive or predictive methods when negatives are hard to define. In scientific discovery or structured data, seek inductive biases (architectures, causal assumptions) to make nonlinear latent factors more identifiable.

⚠️Common Mistakes

Assuming disentanglement emerges “for free.” Without explicit inductive biases (architectures, sparsity, augmentations, group structure), nonlinear factors are generally unidentifiable.
Misusing augmentations. If augmentations remove task-relevant information, the representation collapses or underperforms on downstream tasks.
Confusing MI estimates with guarantees. InfoNCE provides a bound, not an exact MI value; comparing bounds across setups can be misleading due to estimator bias and temperature effects.
Too few negatives in contrastive learning. Small effective negative sets increase variance and can overfit to shortcuts; batch size or memory banks matter.
Ignoring symmetry in ICA. Sources are only identifiable up to permutation and scaling; evaluating with naive MSE can seem poor despite correct recovery modulo these symmetries.
Overfitting pretext tasks. A model can exploit shortcuts (e.g., boundary artifacts) to solve the pretext objective without learning transferable features; use strong augmentations and diverse tasks.
Neglecting normalization and stability. Contrastive losses need numerically stable softmax; ICA needs whitening and careful nonlinearity to avoid divergence.
Forgetting evaluation protocols. Use linear probes, frozen-encoder transfer, and few-shot adaption to fairly assess representation quality.

Key Formulas

Mutual Information

I (X; Y) = E [lo g \frac{p ( X , Y )}{p ( X ) p ( Y )}]

Explanation: Mutual information measures how much knowing X reduces uncertainty about Y (and vice versa). It is zero if and only if X and Y are independent.

InfoNCE Loss

L_{InfoNCE} = - E [lo g \frac{exp ( s ( z , z ^{+} ) / τ )}{\sum _{k = 1}^{K} exp ( s ( z , z _{k} ) / τ )}]

Explanation: This contrastive objective encourages the similarity of a positive pair to exceed those of K negatives. Lower loss means better separation of positives from negatives.

InfoNCE Bound

I (X; Y) \geq lo g K - L_{InfoNCE}

Explanation: With K negatives, InfoNCE provides a lower bound on mutual information between X and Y when positives are drawn from the joint and negatives from the product of marginals.

NT-Xent (Per-Anchor)

NT-Xent (i) = - lo g \frac{exp ( sim ( z _{i} , z _{i^{+}} ) / τ )}{k \neq = i \sum exp ( sim ( z _{i} , z _{k} ) / τ )}

Explanation: For each anchor i among 2N views, the positive is its paired view i^+, and all other 2N-2 views are negatives. Cosine similarity is often used for $ $sim$ $.

$\beta$-VAE Objective

L_{β -VAE} = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - β D_{KL} (q_{ϕ} (z ∣ x) ∥ p (z))

Explanation: Increasing $ $β$ $ tightens the bottleneck, encouraging disentanglement by penalizing information flow from x to z. Too large $ $β$ $ harms reconstruction.

ICA Model

X = A S, S = (S_{1}, \dots, S_{d})^{⊤}, S_{i} independent, non-Gaussian

Explanation: Observations are linear mixtures of independent sources. Under non-Gaussianity (except at most one Gaussian), sources are identifiable up to permutation and scaling.

Whitening

Σ = E [(X - μ) (X - μ)^{⊤}], X_{white} = Σ^{- 1/2} (X - μ)

Explanation: Whitening centers data and removes correlations so covariance becomes the identity. It stabilizes ICA and simplifies subsequent orthogonal unmixing.

FastICA Fixed-Point

w \leftarrow E [x g (w^{⊤} x)] - E [g^{'} (w^{⊤} x)] w, w \leftarrow \frac{w}{∥ w ∥}

Explanation: This update maximizes non-Gaussianity of the projection using a nonlinearity g (e.g., tanh). Normalization enforces unit norm; decorrelation yields multiple components.

Triplet Loss

Triplet (a, p, n) = max {0, m + d (f (a), f (p)) - d (f (a), f (n))}

Explanation: Encourages the anchor to be closer to the positive than to the negative by margin m. It is a non-probabilistic alternative to InfoNCE.

Temperature-Scaled Softmax

softmax (u)_{k} = \frac{exp ( u _{k} / τ )}{\sum _{j} exp ( u _{j} / τ )}

Explanation: Controls the sharpness of the probability distribution over similarities. Lower temperature emphasizes the largest scores and can increase gradient variance.

Complexity Analysis

Contrastive learning with InfoNCE computes, for each anchor, similarities to all negatives in the batch (or memory bank). With batch size B and embedding dimension d, computing the similarity matrix costs O(

B^{2}

d) time and O(

B^{2}

) memory if fully materialized; a streaming softmax reduces memory to O(B). The per-step backward pass has the same asymptotic cost since gradients propagate through all pairwise scores. If an encoder has L layers and per-sample cost O(C), the full step is O(B·C +

B^{2}

d). This quadratic term often dominates, motivating large-batch systems and approximate negative sampling. Linear ICA on n samples of dimension d after centering and whitening has costs dominated by covariance estimation O(n

d^{2}

) and eigen-decomposition O(

d^{3}

) (d is small in many settings). FastICA’s fixed-point iterations require O(T·nd) per component (T iterations), with decorrelation steps O(

d^{2}

) per update. Memory is O(nd) to store data plus O(

d^{2}

) for covariance and basis vectors. For the simplified C++ demos: (1) computing InfoNCE on two B×d matrices uses O(

B^{2}

d) time and O(B d) memory if the similarity matrix is not stored; (2) training a small linear projection with InfoNCE uses O(E·

B^{2}

d) time for E epochs; (3) 2D FastICA runs in O(n) per iteration with negligible constants. In all cases, numerical stability (log-sum-exp), careful normalization (for cosine similarities or whitening), and appropriate batch sizes are crucial to avoid overflow, slow convergence, or collapse.

Code Examples

Compute InfoNCE loss for a batch of embeddings (two views, dot-similarity)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute dot product between two vectors
5 double dot(const vector<double>& a, const vector<double>& b){
6     double s = 0.0; size_t d = a.size();
7     for(size_t i=0;i<d;++i) s += a[i]*b[i];
8     return s;
9 }
10 
11 // Numerically stable log-sum-exp for a vector of scores divided by tau
12 double logsumexp(const vector<double>& scores, double tau){
13     double m = -1e300;
14     for(double s : scores) m = max(m, s/tau);
15     double sum = 0.0;
16     for(double s : scores) sum += exp((s/tau) - m);
17     return m + log(max(sum, 1e-300));
18 }
19 
20 // InfoNCE loss: anchors = Z1, candidates = Z2; positive for i is Z2[i].
21 // Similarity is dot product. Temperature tau > 0.
22 double infoNCE(const vector<vector<double>>& Z1,
23                const vector<vector<double>>& Z2,
24                double tau){
25     size_t B = Z1.size();
26     size_t d = Z1[0].size();
27     (void)d; // unused if compiled with -Wall when not used
28     double loss = 0.0;
29     for(size_t i=0;i<B;++i){
30         vector<double> scores(B);
31         for(size_t k=0;k<B;++k){
32             scores[k] = dot(Z1[i], Z2[k]); // similarity
33         }
34         double lse = logsumexp(scores, tau);
35         double numerator = scores[i]/tau; // log exp(sim/tau) = sim/tau
36         loss += -(numerator - lse);
37     }
38     return loss / static_cast<double>(B);
39 }
40 
41 int main(){
42     ios::sync_with_stdio(false);
43     cin.tie(nullptr);
44 
45     size_t B = 4, d = 3;
46     double tau = 0.2;
47     // Create two views: Z2 is Z1 with small noise so positives are aligned
48     vector<vector<double>> Z1(B, vector<double>(d));
49     vector<vector<double>> Z2(B, vector<double>(d));
50     std::mt19937 rng(42);
51     std::normal_distribution<double> norm(0.0, 1.0);
52     std::normal_distribution<double> small(0.0, 0.1);
53 
54     for(size_t i=0;i<B;++i){
55         for(size_t j=0;j<d;++j){
56             Z1[i][j] = norm(rng);
57         }
58     }
59     for(size_t i=0;i<B;++i){
60         for(size_t j=0;j<d;++j){
61             Z2[i][j] = Z1[i][j] + small(rng); // positive pairs are close
62         }
63     }
64 
65     double L = infoNCE(Z1, Z2, tau);
66     cout << fixed << setprecision(6);
67     cout << "InfoNCE loss: " << L << "\n";
68 
69     return 0;
70 }
71

This program computes the InfoNCE loss for a small batch with two views, assuming the i-th vector in Z1 matches the i-th in Z2. We use dot-product similarity and a temperature-scaled softmax computed via log-sum-exp for numerical stability. The positives are designed to be similar (Z2 is Z1 plus small noise), so the loss should be relatively low. This mirrors the core step in contrastive learning that pulls positives together and pushes against many negatives.

Time: O(B^2 d) for a B×d batch (pairwise similarities per anchor).Space: O(B d) to store inputs; O(B) temporary storage per anchor.

Train a tiny linear projection with InfoNCE on synthetic pairs (simplified SimCLR)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Mat2 {
5     double a11, a12, a21, a22; // row-major 2x2
6 };
7 
8 // y = W * x for 2D vectors
9 array<double,2> matvec(const Mat2& W, const array<double,2>& x){
10     return {W.a11*x[0] + W.a12*x[1], W.a21*x[0] + W.a22*x[1]};
11 }
12 
13 // dot for 2D
14 inline double dot2(const array<double,2>& u, const array<double,2>& v){
15     return u[0]*v[0] + u[1]*v[1];
16 }
17 
18 // Update W by gradient descent on InfoNCE where sim(i,k) = (W x_i) · y_k
19 // Gradient wrt W is sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T)
20 int main(){
21     ios::sync_with_stdio(false);
22     cin.tie(nullptr);
23 
24     // Synthetic data: base points x_i around a circle; positives y_i are noisy rotated versions
25     size_t B = 64; // batch size
26     size_t epochs = 200;
27     double tau = 0.2;
28     double lr = 0.05;
29     std::mt19937 rng(123);
30     std::uniform_real_distribution<double> uni(0.0, 2.0*M_PI);
31     std::normal_distribution<double> noise(0.0, 0.05);
32 
33     vector<array<double,2>> X(B), Y(B);
34     for(size_t i=0;i<B;++i){
35         double t = uni(rng);
36         double r = 1.0 + 0.1*noise(rng);
37         X[i] = {r*cos(t), r*sin(t)};
38         // Augmentation: small rotation + noise
39         double dt = 0.1*noise(rng);
40         double c = cos(dt), s = sin(dt);
41         array<double,2> x = X[i];
42         Y[i] = {c*x[0] - s*x[1] + noise(rng), s*x[0] + c*x[1] + noise(rng)};
43     }
44 
45     // Initialize linear projector W randomly
46     std::normal_distribution<double> initn(0.0, 0.5);
47     Mat2 W{initn(rng), initn(rng), initn(rng), initn(rng)};
48 
49     auto infoNCE_loss = [&](const Mat2& W)->double{
50         double L=0.0;
51         for(size_t i=0;i<B;++i){
52             array<double,2> z = matvec(W, X[i]);
53             vector<double> scores(B);
54             double m = -1e300;
55             for(size_t k=0;k<B;++k){
56                 scores[k] = dot2(z, Y[k]);
57                 m = max(m, scores[k]/tau);
58             }
59             double denom=0.0;
60             for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
61             double lse = m + log(max(denom, 1e-300));
62             L += -(scores[i]/tau - lse);
63         }
64         return L / (double)B;
65     };
66 
67     for(size_t ep=0; ep<epochs; ++ep){
68         // Accumulate gradient dL/dW
69         double g11=0.0, g12=0.0, g21=0.0, g22=0.0;
70         double loss=0.0;
71         for(size_t i=0;i<B;++i){
72             array<double,2> z = matvec(W, X[i]);
73             vector<double> scores(B);
74             double m = -1e300;
75             for(size_t k=0;k<B;++k){
76                 scores[k] = dot2(z, Y[k]);
77                 m = max(m, scores[k]/tau);
78             }
79             double denom=0.0;
80             for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
81             vector<double> probs(B);
82             for(size_t k=0;k<B;++k) probs[k] = exp((scores[k]/tau) - m) / max(denom, 1e-300);
83 
84             // Per-sample loss contribution
85             double lse = m + log(max(denom, 1e-300));
86             loss += -(scores[i]/tau - lse);
87 
88             // Gradient: sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T) / tau
89             for(size_t k=0;k<B;++k){
90                 double coeff = (probs[k] - (k==i ? 1.0 : 0.0)) / tau;
91                 // y_k x_i^T adds to W's gradient
92                 g11 += coeff * (Y[k][0] * X[i][0]);
93                 g12 += coeff * (Y[k][0] * X[i][1]);
94                 g21 += coeff * (Y[k][1] * X[i][0]);
95                 g22 += coeff * (Y[k][1] * X[i][1]);
96             }
97         }
98         loss /= (double)B;
99         // Gradient step (average over batch)
100         W.a11 -= lr * (g11 / (double)B);
101         W.a12 -= lr * (g12 / (double)B);
102         W.a21 -= lr * (g21 / (double)B);
103         W.a22 -= lr * (g22 / (double)B);
104 
105         if((ep+1)%50==0){
106             cout << "Epoch " << (ep+1) << ": loss = " << fixed << setprecision(6) << loss << "\n";
107         }
108     }
109 
110     // Final loss
111     cout << "Final InfoNCE loss: " << fixed << setprecision(6) << infoNCE_loss(W) << "\n";
112     return 0;
113 }
114

We generate 2D points on a noisy circle and create positive pairs by applying a small random rotation and noise. We learn a 2×2 linear projection W so that (W x_i) is close (by dot product) to its positive y_i and far from other y_k via InfoNCE. The gradient of the dot similarity with respect to W is y_k x_i^T, making the implementation simple. The printed loss should decrease over epochs, illustrating how contrastive objectives shape a representation even with a tiny linear model.

Time: O(E · B^2 d) with d=2 and E epochs (pairwise scores per anchor per epoch).Space: O(B d) for data plus O(B) for softmax probabilities.

2D FastICA on synthetic mixtures (recover independent sources)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Vec2 { double x, y; };
5 struct Mat2 { double a,b,c,d; /* [a b; c d] */ };
6 
7 Vec2 operator+(const Vec2& u, const Vec2& v){ return {u.x+v.x, u.y+v.y}; }
8 Vec2 operator-(const Vec2& u, const Vec2& v){ return {u.x-v.x, u.y-v.y}; }
9 Vec2 operator*(double s, const Vec2& u){ return {s*u.x, s*u.y}; }
10 Vec2 operator*(const Mat2& M, const Vec2& v){ return {M.a*v.x + M.b*v.y, M.c*v.x + M.d*v.y}; }
11 Mat2 operator*(const Mat2& A, const Mat2& B){ return {A.a*B.a + A.b*B.c, A.a*B.b + A.b*B.d, A.c*B.a + A.d*B.c, A.c*B.b + A.d*B.d}; }
12 Mat2 transpose(const Mat2& M){ return {M.a, M.c, M.b, M.d}; }
13 
14 double dot(const Vec2& u, const Vec2& v){ return u.x*v.x + u.y*v.y; }
15 Vec2 normalize(const Vec2& v){ double n = sqrt(max(1e-12, v.x*v.x + v.y*v.y)); return {v.x/n, v.y/n}; }
16 
17 // Compute mean of vectors
18 Vec2 meanv(const vector<Vec2>& X){ Vec2 m{0,0}; for(auto& x: X){ m.x+=x.x; m.y+=x.y; } m.x/=X.size(); m.y/=X.size(); return m; }
19 
20 // Covariance of centered data
21 Mat2 cov(const vector<Vec2>& X){
22     double sxx=0, sxy=0, syy=0; size_t n=X.size();
23     for(auto& x: X){ sxx += x.x*x.x; sxy += x.x*x.y; syy += x.y*x.y; }
24     sxx/=n; sxy/=n; syy/=n;
25     return {sxx, sxy, sxy, syy};
26 }
27 
28 // Eigen-decomposition of symmetric 2x2 [[a,b],[b,c]]
29 void eig2_sym(double a, double b, double c, double &l1, double &l2, Mat2 &V){
30     double tr = a + c;
31     double det = a*c - b*b;
32     double temp = sqrt(max(0.0, tr*tr/4.0 - det));
33     l1 = tr/2.0 + temp; l2 = tr/2.0 - temp;
34     // Eigenvector for l1: (b, l1 - a) if b != 0, else (1,0)
35     Vec2 v1 = (fabs(b) > 1e-12) ? Vec2{b, l1 - a} : Vec2{1.0, 0.0};
36     Vec2 v2 = (fabs(b) > 1e-12) ? Vec2{b, l2 - a} : Vec2{0.0, 1.0};
37     v1 = normalize(v1); v2 = normalize(v2);
38     // Orthonormalize: ensure right-hand basis
39     // Make v2 orthogonal to v1 if numerical issues
40     double proj = v1.x*v2.x + v1.y*v2.y;
41     v2 = normalize(Vec2{v2.x - proj*v1.x, v2.y - proj*v1.y});
42     V = {v1.x, v2.x, v1.y, v2.y}; // columns are eigenvectors
43 }
44 
45 // Whitening matrix: W = V * diag(1/sqrt(l)) * V^T
46 Mat2 whitening_from_cov(const Mat2& C){
47     double l1,l2; Mat2 V;
48     eig2_sym(C.a, C.b, C.d, l1, l2, V);
49     double s1 = 1.0 / sqrt(max(l1, 1e-12));
50     double s2 = 1.0 / sqrt(max(l2, 1e-12));
51     Mat2 D{ s1, 0.0, 0.0, s2 };
52     Mat2 VT = transpose(V);
53     return V * D * VT;
54 }
55 
56 int main(){
57     ios::sync_with_stdio(false);
58     cin.tie(nullptr);
59 
60     // Generate independent non-Gaussian sources (Laplace and Uniform)
61     size_t n = 5000;
62     std::mt19937 rng(7);
63     std::exponential_distribution<double> expo(1.0); // Laplace via difference of exponentials
64     std::uniform_real_distribution<double> uni(-1.0, 1.0);
65     vector<Vec2> S(n);
66     for(size_t i=0;i<n;++i){
67         double lap = (expo(rng) - expo(rng));
68         double uni0 = uni(rng);
69         S[i] = {lap, uni0};
70     }
71 
72     // Mixing matrix A (invertible)
73     Mat2 A{1.0, 0.7, -0.4, 1.2};
74     vector<Vec2> X(n);
75     for(size_t i=0;i<n;++i){ X[i] = A * S[i]; }
76 
77     // Center
78     Vec2 mX = meanv(X);
79     for(auto& x: X){ x = x - mX; }
80 
81     // Whiten
82     Mat2 C = cov(X);
83     Mat2 Wwhite = whitening_from_cov(C);
84     vector<Vec2> Xw(n);
85     for(size_t i=0;i<n;++i){ Xw[i] = Wwhite * X[i]; }
86 
87     // FastICA with g(u)=tanh(u), g'(u)=1 - tanh^2(u)
88     auto fastica_component = [&](const vector<Vec2>& Xw, const Vec2& w_init, const Vec2* w_orth) {
89         Vec2 w = normalize(w_init);
90         for(int it=0; it<200; ++it){
91             Vec2 Egx{0,0}; double Egp = 0.0; // E[x g(w^T x)], E[g'(w^T x)]
92             for(const auto& x : Xw){
93                 double u = w.x*x.x + w.y*x.y;
94                 double g = tanh(u);
95                 double gp = 1.0 - g*g;
96                 Egx.x += x.x * g; Egx.y += x.y * g;
97                 Egp += gp;
98             }
99             double invn = 1.0 / (double)Xw.size();
100             Egx.x *= invn; Egx.y *= invn; Egp *= invn;
101             Vec2 w_new{Egx.x - Egp*w.x, Egx.y - Egp*w.y};
102             // Decorrelate from previous component if provided
103             if(w_orth){
104                 double proj = w_new.x*w_orth->x + w_new.y*w_orth->y;
105                 w_new.x -= proj * w_orth->x;
106                 w_new.y -= proj * w_orth->y;
107             }
108             w = normalize(w_new);
109         }
110         return w;
111     };
112 
113     // Estimate two components
114     Vec2 w1 = fastica_component(Xw, Vec2{1.0, 0.0}, nullptr);
115     Vec2 w2 = fastica_component(Xw, Vec2{0.0, 1.0}, &w1);
116 
117     // Recover sources: s_hat = [w1; w2] * xw
118     vector<Vec2> Sh(n);
119     for(size_t i=0;i<n;++i){
120         double s1 = w1.x*Xw[i].x + w1.y*Xw[i].y;
121         double s2 = w2.x*Xw[i].x + w2.y*Xw[i].y;
122         Sh[i] = {s1, s2};
123     }
124 
125     // Evaluate by absolute correlation (up to permutation & scaling)
126     auto corr = [&](const vector<double>& a, const vector<double>& b){
127         size_t N = a.size();
128         double ma=0, mb=0; for(size_t i=0;i<N;++i){ ma+=a[i]; mb+=b[i]; } ma/=N; mb/=N;
129         double va=0, vb=0, cab=0;
130         for(size_t i=0;i<N;++i){ double da=a[i]-ma, db=b[i]-mb; va+=da*da; vb+=db*db; cab+=da*db; }
131         return cab / sqrt(max(1e-12, va*vb));
132     };
133 
134     vector<double> s1(n), s2(n), h1(n), h2(n);
135     for(size_t i=0;i<n;++i){ s1[i]=S[i].x; s2[i]=S[i].y; h1[i]=Sh[i].x; h2[i]=Sh[i].y; }
136 
137     double c11 = fabs(corr(s1,h1)), c12 = fabs(corr(s1,h2));
138     double c21 = fabs(corr(s2,h1)), c22 = fabs(corr(s2,h2));
139 
140     cout << fixed << setprecision(4);
141     cout << "|corr(s1, h1)| = " << c11 << ", |corr(s1, h2)| = " << c12 << "\n";
142     cout << "|corr(s2, h1)| = " << c21 << ", |corr(s2, h2)| = " << c22 << "\n";
143     cout << "(Expect each source to align strongly with one recovered component.)\n";
144 
145     return 0;
146 }
147

This demo generates two independent sources (Laplace and Uniform), mixes them linearly, and then applies whitening followed by 2D FastICA. The fixed-point update with g(u)=tanh(u) maximizes non-Gaussianity, and orthogonalization enforces independent components. The printed absolute correlations show that each true source aligns with one recovered component (up to permutation and scaling), illustrating linear identifiability in ICA.

Time: O(n) per iteration (2D), dominated by computing expectations; plus O(1) for 2×2 eigendecomposition.Space: O(n) to store samples; O(1) for parameters and small matrices.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute dot product between two vectors
5	double dot(const vector<double>& a, const vector<double>& b){
6	double s = 0.0; size_t d = a.size();
7	for(size_t i=0;i<d;++i) s += a[i]*b[i];
8	return s;
9	}
10
11	// Numerically stable log-sum-exp for a vector of scores divided by tau
12	double logsumexp(const vector<double>& scores, double tau){
13	double m = -1e300;
14	for(double s : scores) m = max(m, s/tau);
15	double sum = 0.0;
16	for(double s : scores) sum += exp((s/tau) - m);
17	return m + log(max(sum, 1e-300));
18	}
19
20	// InfoNCE loss: anchors = Z1, candidates = Z2; positive for i is Z2[i].
21	// Similarity is dot product. Temperature tau > 0.
22	double infoNCE(const vector<vector<double>>& Z1,
23	const vector<vector<double>>& Z2,
24	double tau){
25	size_t B = Z1.size();
26	size_t d = Z1[0].size();
27	(void)d; // unused if compiled with -Wall when not used
28	double loss = 0.0;
29	for(size_t i=0;i<B;++i){
30	vector<double> scores(B);
31	for(size_t k=0;k<B;++k){
32	scores[k] = dot(Z1[i], Z2[k]); // similarity
33	}
34	double lse = logsumexp(scores, tau);
35	double numerator = scores[i]/tau; // log exp(sim/tau) = sim/tau
36	loss += -(numerator - lse);
37	}
38	return loss / static_cast<double>(B);
39	}
40
41	int main(){
42	ios::sync_with_stdio(false);
43	cin.tie(nullptr);
44
45	size_t B = 4, d = 3;
46	double tau = 0.2;
47	// Create two views: Z2 is Z1 with small noise so positives are aligned
48	vector<vector<double>> Z1(B, vector<double>(d));
49	vector<vector<double>> Z2(B, vector<double>(d));
50	std::mt19937 rng(42);
51	std::normal_distribution<double> norm(0.0, 1.0);
52	std::normal_distribution<double> small(0.0, 0.1);
53
54	for(size_t i=0;i<B;++i){
55	for(size_t j=0;j<d;++j){
56	Z1[i][j] = norm(rng);
57	}
58	}
59	for(size_t i=0;i<B;++i){
60	for(size_t j=0;j<d;++j){
61	Z2[i][j] = Z1[i][j] + small(rng); // positive pairs are close
62	}
63	}
64
65	double L = infoNCE(Z1, Z2, tau);
66	cout << fixed << setprecision(6);
67	cout << "InfoNCE loss: " << L << "\n";
68
69	return 0;
70	}
71

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Mat2 {
5	double a11, a12, a21, a22; // row-major 2x2
6	};
7
8	// y = W * x for 2D vectors
9	array<double,2> matvec(const Mat2& W, const array<double,2>& x){
10	return {W.a11x[0] + W.a12x[1], W.a21x[0] + W.a22x[1]};
11	}
12
13	// dot for 2D
14	inline double dot2(const array<double,2>& u, const array<double,2>& v){
15	return u[0]v[0] + u[1]v[1];
16	}
17
18	// Update W by gradient descent on InfoNCE where sim(i,k) = (W x_i) · y_k
19	// Gradient wrt W is sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T)
20	int main(){
21	ios::sync_with_stdio(false);
22	cin.tie(nullptr);
23
24	// Synthetic data: base points x_i around a circle; positives y_i are noisy rotated versions
25	size_t B = 64; // batch size
26	size_t epochs = 200;
27	double tau = 0.2;
28	double lr = 0.05;
29	std::mt19937 rng(123);
30	std::uniform_real_distribution<double> uni(0.0, 2.0*M_PI);
31	std::normal_distribution<double> noise(0.0, 0.05);
32
33	vector<array<double,2>> X(B), Y(B);
34	for(size_t i=0;i<B;++i){
35	double t = uni(rng);
36	double r = 1.0 + 0.1*noise(rng);
37	X[i] = {rcos(t), rsin(t)};
38	// Augmentation: small rotation + noise
39	double dt = 0.1*noise(rng);
40	double c = cos(dt), s = sin(dt);
41	array<double,2> x = X[i];
42	Y[i] = {cx[0] - sx[1] + noise(rng), sx[0] + cx[1] + noise(rng)};
43	}
44
45	// Initialize linear projector W randomly
46	std::normal_distribution<double> initn(0.0, 0.5);
47	Mat2 W{initn(rng), initn(rng), initn(rng), initn(rng)};
48
49	auto infoNCE_loss = [&](const Mat2& W)->double{
50	double L=0.0;
51	for(size_t i=0;i<B;++i){
52	array<double,2> z = matvec(W, X[i]);
53	vector<double> scores(B);
54	double m = -1e300;
55	for(size_t k=0;k<B;++k){
56	scores[k] = dot2(z, Y[k]);
57	m = max(m, scores[k]/tau);
58	}
59	double denom=0.0;
60	for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
61	double lse = m + log(max(denom, 1e-300));
62	L += -(scores[i]/tau - lse);
63	}
64	return L / (double)B;
65	};
66
67	for(size_t ep=0; ep<epochs; ++ep){
68	// Accumulate gradient dL/dW
69	double g11=0.0, g12=0.0, g21=0.0, g22=0.0;
70	double loss=0.0;
71	for(size_t i=0;i<B;++i){
72	array<double,2> z = matvec(W, X[i]);
73	vector<double> scores(B);
74	double m = -1e300;
75	for(size_t k=0;k<B;++k){
76	scores[k] = dot2(z, Y[k]);
77	m = max(m, scores[k]/tau);
78	}
79	double denom=0.0;
80	for(size_t k=0;k<B;++k) denom += exp((scores[k]/tau) - m);
81	vector<double> probs(B);
82	for(size_t k=0;k<B;++k) probs[k] = exp((scores[k]/tau) - m) / max(denom, 1e-300);
83
84	// Per-sample loss contribution
85	double lse = m + log(max(denom, 1e-300));
86	loss += -(scores[i]/tau - lse);
87
88	// Gradient: sum_k (p_{ik} - 1[k=i]) * (y_k x_i^T) / tau
89	for(size_t k=0;k<B;++k){
90	double coeff = (probs[k] - (k==i ? 1.0 : 0.0)) / tau;
91	// y_k x_i^T adds to W's gradient
92	g11 += coeff * (Y[k][0] * X[i][0]);
93	g12 += coeff * (Y[k][0] * X[i][1]);
94	g21 += coeff * (Y[k][1] * X[i][0]);
95	g22 += coeff * (Y[k][1] * X[i][1]);
96	}
97	}
98	loss /= (double)B;
99	// Gradient step (average over batch)
100	W.a11 -= lr * (g11 / (double)B);
101	W.a12 -= lr * (g12 / (double)B);
102	W.a21 -= lr * (g21 / (double)B);
103	W.a22 -= lr * (g22 / (double)B);
104
105	if((ep+1)%50==0){
106	cout << "Epoch " << (ep+1) << ": loss = " << fixed << setprecision(6) << loss << "\n";
107	}
108	}
109
110	// Final loss
111	cout << "Final InfoNCE loss: " << fixed << setprecision(6) << infoNCE_loss(W) << "\n";
112	return 0;
113	}
114

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Vec2 { double x, y; };
5	struct Mat2 { double a,b,c,d; /* [a b; c d] */ };
6
7	Vec2 operator+(const Vec2& u, const Vec2& v){ return {u.x+v.x, u.y+v.y}; }
8	Vec2 operator-(const Vec2& u, const Vec2& v){ return {u.x-v.x, u.y-v.y}; }
9	Vec2 operator(double s, const Vec2& u){ return {su.x, s*u.y}; }
10	Vec2 operator(const Mat2& M, const Vec2& v){ return {M.av.x + M.bv.y, M.cv.x + M.d*v.y}; }
11	Mat2 operator(const Mat2& A, const Mat2& B){ return {A.aB.a + A.bB.c, A.aB.b + A.bB.d, A.cB.a + A.dB.c, A.cB.b + A.d*B.d}; }
12	Mat2 transpose(const Mat2& M){ return {M.a, M.c, M.b, M.d}; }
13
14	double dot(const Vec2& u, const Vec2& v){ return u.xv.x + u.yv.y; }
15	Vec2 normalize(const Vec2& v){ double n = sqrt(max(1e-12, v.xv.x + v.yv.y)); return {v.x/n, v.y/n}; }
16
17	// Compute mean of vectors
18	Vec2 meanv(const vector<Vec2>& X){ Vec2 m{0,0}; for(auto& x: X){ m.x+=x.x; m.y+=x.y; } m.x/=X.size(); m.y/=X.size(); return m; }
19
20	// Covariance of centered data
21	Mat2 cov(const vector<Vec2>& X){
22	double sxx=0, sxy=0, syy=0; size_t n=X.size();
23	for(auto& x: X){ sxx += x.xx.x; sxy += x.xx.y; syy += x.y*x.y; }
24	sxx/=n; sxy/=n; syy/=n;
25	return {sxx, sxy, sxy, syy};
26	}
27
28	// Eigen-decomposition of symmetric 2x2 [[a,b],[b,c]]
29	void eig2_sym(double a, double b, double c, double &l1, double &l2, Mat2 &V){
30	double tr = a + c;
31	double det = ac - bb;
32	double temp = sqrt(max(0.0, tr*tr/4.0 - det));
33	l1 = tr/2.0 + temp; l2 = tr/2.0 - temp;
34	// Eigenvector for l1: (b, l1 - a) if b != 0, else (1,0)
35	Vec2 v1 = (fabs(b) > 1e-12) ? Vec2{b, l1 - a} : Vec2{1.0, 0.0};
36	Vec2 v2 = (fabs(b) > 1e-12) ? Vec2{b, l2 - a} : Vec2{0.0, 1.0};
37	v1 = normalize(v1); v2 = normalize(v2);
38	// Orthonormalize: ensure right-hand basis
39	// Make v2 orthogonal to v1 if numerical issues
40	double proj = v1.xv2.x + v1.yv2.y;
41	v2 = normalize(Vec2{v2.x - projv1.x, v2.y - projv1.y});
42	V = {v1.x, v2.x, v1.y, v2.y}; // columns are eigenvectors
43	}
44
45	// Whitening matrix: W = V * diag(1/sqrt(l)) * V^T
46	Mat2 whitening_from_cov(const Mat2& C){
47	double l1,l2; Mat2 V;
48	eig2_sym(C.a, C.b, C.d, l1, l2, V);
49	double s1 = 1.0 / sqrt(max(l1, 1e-12));
50	double s2 = 1.0 / sqrt(max(l2, 1e-12));
51	Mat2 D{ s1, 0.0, 0.0, s2 };
52	Mat2 VT = transpose(V);
53	return V * D * VT;
54	}
55
56	int main(){
57	ios::sync_with_stdio(false);
58	cin.tie(nullptr);
59
60	// Generate independent non-Gaussian sources (Laplace and Uniform)
61	size_t n = 5000;
62	std::mt19937 rng(7);
63	std::exponential_distribution<double> expo(1.0); // Laplace via difference of exponentials
64	std::uniform_real_distribution<double> uni(-1.0, 1.0);
65	vector<Vec2> S(n);
66	for(size_t i=0;i<n;++i){
67	double lap = (expo(rng) - expo(rng));
68	double uni0 = uni(rng);
69	S[i] = {lap, uni0};
70	}
71
72	// Mixing matrix A (invertible)
73	Mat2 A{1.0, 0.7, -0.4, 1.2};
74	vector<Vec2> X(n);
75	for(size_t i=0;i<n;++i){ X[i] = A * S[i]; }
76
77	// Center
78	Vec2 mX = meanv(X);
79	for(auto& x: X){ x = x - mX; }
80
81	// Whiten
82	Mat2 C = cov(X);
83	Mat2 Wwhite = whitening_from_cov(C);
84	vector<Vec2> Xw(n);
85	for(size_t i=0;i<n;++i){ Xw[i] = Wwhite * X[i]; }
86
87	// FastICA with g(u)=tanh(u), g'(u)=1 - tanh^2(u)
88	auto fastica_component = [&](const vector<Vec2>& Xw, const Vec2& w_init, const Vec2* w_orth) {
89	Vec2 w = normalize(w_init);
90	for(int it=0; it<200; ++it){
91	Vec2 Egx{0,0}; double Egp = 0.0; // E[x g(w^T x)], E[g'(w^T x)]
92	for(const auto& x : Xw){
93	double u = w.xx.x + w.yx.y;
94	double g = tanh(u);
95	double gp = 1.0 - g*g;
96	Egx.x += x.x * g; Egx.y += x.y * g;
97	Egp += gp;
98	}
99	double invn = 1.0 / (double)Xw.size();
100	Egx.x = invn; Egx.y = invn; Egp *= invn;
101	Vec2 w_new{Egx.x - Egpw.x, Egx.y - Egpw.y};
102	// Decorrelate from previous component if provided
103	if(w_orth){
104	double proj = w_new.xw_orth->x + w_new.yw_orth->y;
105	w_new.x -= proj * w_orth->x;
106	w_new.y -= proj * w_orth->y;
107	}
108	w = normalize(w_new);
109	}
110	return w;
111	};
112
113	// Estimate two components
114	Vec2 w1 = fastica_component(Xw, Vec2{1.0, 0.0}, nullptr);
115	Vec2 w2 = fastica_component(Xw, Vec2{0.0, 1.0}, &w1);
116
117	// Recover sources: s_hat = [w1; w2] * xw
118	vector<Vec2> Sh(n);
119	for(size_t i=0;i<n;++i){
120	double s1 = w1.xXw[i].x + w1.yXw[i].y;
121	double s2 = w2.xXw[i].x + w2.yXw[i].y;
122	Sh[i] = {s1, s2};
123	}
124
125	// Evaluate by absolute correlation (up to permutation & scaling)
126	auto corr = [&](const vector<double>& a, const vector<double>& b){
127	size_t N = a.size();
128	double ma=0, mb=0; for(size_t i=0;i<N;++i){ ma+=a[i]; mb+=b[i]; } ma/=N; mb/=N;
129	double va=0, vb=0, cab=0;
130	for(size_t i=0;i<N;++i){ double da=a[i]-ma, db=b[i]-mb; va+=dada; vb+=dbdb; cab+=da*db; }
131	return cab / sqrt(max(1e-12, va*vb));
132	};
133
134	vector<double> s1(n), s2(n), h1(n), h2(n);
135	for(size_t i=0;i<n;++i){ s1[i]=S[i].x; s2[i]=S[i].y; h1[i]=Sh[i].x; h2[i]=Sh[i].y; }
136
137	double c11 = fabs(corr(s1,h1)), c12 = fabs(corr(s1,h2));
138	double c21 = fabs(corr(s2,h1)), c22 = fabs(corr(s2,h2));
139
140	cout << fixed << setprecision(4);
141	cout << "\|corr(s1, h1)\| = " << c11 << ", \|corr(s1, h2)\| = " << c12 << "\n";
142	cout << "\|corr(s2, h1)\| = " << c21 << ", \|corr(s2, h2)\| = " << c22 << "\n";
143	cout << "(Expect each source to align strongly with one recovered component.)\n";
144
145	return 0;
146	}
147