📚TheoryAdvanced

Spectral Analysis of Neural Networks

Key Points

•
Spectral analysis studies the distribution of eigenvalues and singular values of neural network weight matrices during training.
•
The empirical spectral distribution (ESD) often shows a random-matrix “bulk” plus a few outlier eigenvalues that capture learned structure.
•
The largest singular value (spectral norm) controls gradient amplification and stable learning-rate choices.
•
For non-symmetric weight matrices, singular values are usually more informative than eigenvalues; use $W^{T}$ W to get a symmetric, positive semidefinite matrix.
•
Random Matrix Theory (RMT) provides useful baselines such as the Marchenko–Pastur law to compare with observed spectra.
•
Tracking spectra across epochs can diagnose exploding/vanishing gradients, overfitting, and implicit regularization.
•
Exact full spectra cost O( $n^{3}$ ); scalable approximations (power iteration, Lanczos) run in O(k $n^{2}$ ) for large layers.
•
In practice, analyze the symmetric part (W + $W^{T}$ )/2 or the Gram matrix $W^{T}$ W, and use histograms to inspect the ESD.

Prerequisites

→Linear algebra (vectors, matrices, eigenvalues/SVD) — Spectral analysis relies on eigenvalues and singular values, Gram matrices, and orthogonality.
→Probability and Random Matrix Theory (basics) — MP law and bulk/outlier interpretations require understanding of i.i.d. models and limiting distributions.
→Gradient-based optimization — Learning-rate stability bounds involve top eigenvalues of Hessians or Lipschitz constants from spectral norms.
→Neural network architectures (linear layers, ReLU, CNNs) — You must know how weights map to matrices to form W, W^T W, or symmetric parts for analysis.
→Numerical methods — Iterative eigensolvers (power iteration, Lanczos) and stability concerns are essential for scalable analysis.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral analysis of neural networks examines the eigenvalues and singular values of matrices that arise in learning—primarily weight matrices, but also Hessians, Jacobians, and Gram/NTK matrices. During training, these spectra evolve in characteristic ways: early on they can resemble random matrices, while later certain directions (outliers) emerge that reflect the learned patterns in data. By inspecting the empirical spectral distribution (ESD)—the histogram of eigenvalues or singular values—we gain a window into stability (e.g., whether gradients explode), expressivity (which directions dominate), and generalization (how heavy-tailed the spectrum is). Random Matrix Theory (RMT) provides a powerful reference: purely random weights have predictable spectra (e.g., Marchenko–Pastur for singular values), so deviations from these baselines indicate learned structure or pathologies. Practically, full eigenvalue decompositions are expensive for modern networks, but scalable approximations—tracking just the top singular value (spectral norm) or using Krylov methods—are often sufficient. This perspective is model-agnostic: it applies to linear layers and, with care, to convolutional and attention blocks by reshaping operators into matrices. Overall, spectral analysis offers a principled, quantitative tool to monitor and guide training beyond simple scalar metrics like loss and accuracy.

02Intuition & Analogies

Think of a weight matrix as a machine that stretches and rotates space. Drop a bunch of arrows (vectors) into this machine: some directions get stretched a lot, others compressed. The stretching factors are the singular values; the most-stretched direction corresponds to the largest singular value. If that top stretch is huge, tiny errors can get blown up—like whispering into a loudspeaker cranked to max—leading to unstable training. Conversely, if everything is squashed, signals and gradients fade away. Now imagine plotting how much stretching happens in all directions as a histogram: that’s the empirical spectral distribution. For a brand-new, randomly initialized network, the histogram typically has a smooth “bulk” predicted by probability theory (RMT). As the model learns structure in data, a few bars start poking out from the bulk: these are outliers, representing directions that the network has tuned to be especially sensitive or expressive. You can think of the bulk as background noise and the outliers as the melody—it’s where the learning lives. Over training, watching those bars move helps you see if your learning rate is reasonable (top stretch not exploding), if regularization is working (not too many huge bars), and if you’re starting to overfit (heavy tail growth). For non-symmetric matrices (most weights), eigenvalues can be complex and tricky; instead, look at the symmetric part or at W^T W, whose eigenvalues are the squared singular values—clean, real, and easy to interpret.

03Formal Definition

Given a matrix W

\in

R^{m \times n}

, its singular values \{

σ_{i}

\}_{

i=1}^{\min(m

,n)} are the square roots of the eigenvalues of

W^{⊤}

W. The spectral norm is \

∣ W ∥

_2 =

max_{i}

σ_{i}

, which equals the operator norm induced by the Euclidean norm. For a square matrix W

\in

R^{n \times n}

, eigenvalues \{

λ_{i}

\}_{

i=1}^n

solve

det

(W-

λ

I)=0 and may be complex if W is not symmetric. The empirical spectral distribution (ESD) of a Hermitian matrix A

\in

R^{n \times n}

with eigenvalues

λ_{1}

\dots

λ_{n}

μ_{A}

\frac{1}{n} \sum_{i = 1}^{n}

δ_{λ_{i}}

, i.e., the uniform measure on its eigenvalues. For non-symmetric W, we typically analyze

A = W^{⊤} W

(positive semidefinite), so

μ_{A}

is defined on

R_{\geq 0}

and corresponds to squared singular values. Random Matrix Theory predicts baseline ESDs for matrices with i.i.d. entries: the Marchenko–Pastur law describes the limit of

μ_{\frac{1}{n} X^{⊤} X}

for rectangular Gaussian X as dimensions grow with fixed aspect ratio. Training modifies W so its Gram spectra deviate from these baselines, often producing outliers beyond the bulk edge. The spectral radius

ρ

(W)=

max_{i}

∣ λ_{i} ∣

and condition number

κ_{2}

(W)=

σ_{m a x}

σ_{m i n}

are key stability indicators. In optimization, step-size stability for gradient descent on a quadratic f(x)=

\frac{1}{2} x^{⊤}

Hx-

b^{⊤}

x requires 0<

η

<2/

λ_{m a x}

(H).

04When to Use

Use spectral analysis when you want to diagnose or improve training stability and generalization. Concretely: (1) Before training, inspect |W|2 for new architectures or initializations; very large spectral norms suggest gradients may explode, motivating scaling or spectral norm regularization. (2) During training, track the top singular value of key layers; sudden growth can signal that the learning rate is too high. (3) To study generalization, compare the ESD of W^{\top}W to the Marchenko–Pastur bulk: outliers indicate learned structure; heavy tails can correlate with over-parameterization and implicit regularization effects. (4) For second-order diagnostics, examine the Hessian’s top eigenvalues to gauge curvature; large \lambda{\max}(H) suggests reducing the learning rate or using adaptive/second-order methods. (5) In pruning or compression, small singular values indicate near-redundant directions that can be truncated with limited performance impact. (6) For safety/robustness, large spectral norms in early layers imply strong input amplification and potential adversarial vulnerability; spectral normalization can help. (7) In sequence models and RNNs, keeping spectral radius near 1 in recurrent matrices mitigates vanishing/exploding gradients.

⚠️Common Mistakes

Confusing eigenvalues with singular values for non-symmetric weights. Eigenvalues can be complex and do not directly measure amplification; use singular values via W^{\top}W instead.\n- Looking only at the largest singular value and ignoring the distribution. The bulk shape, outliers, and tails carry distinct information about learned structure and regularization.\n- Misapplying random-matrix baselines. Marchenko–Pastur assumes i.i.d. entries; convolutions, normalization layers, and weight sharing break assumptions, so treat RMT as a qualitative guide, not a strict test.\n- Forming W^{\top}W explicitly for huge layers. This squares the condition number and costs O(n^2 m) memory/time; prefer implicit products A v = W^{\top}(W v) for iterative methods.\n- Using power iteration without normalization or with too few iterations, leading to poor estimates of the spectral norm; monitor the Rayleigh quotient and convergence.\n- Interpreting negative eigenvalues of the symmetric part (W+W^{\top})/2 as instability in forward pass; they relate to non-normality and asymmetric components—prefer singular values for amplification.\n- Ignoring scaling. Compare spectra across epochs using consistent normalization (e.g., by layer width) or else apparent changes may just reflect size, not learning.\n- Relying on tiny samples for histograms. Use enough bins and samples; for large matrices, adopt stochastic methods (Lanczos, Hutchinson) to estimate spectral densities reliably.

Key Formulas

Empirical Spectral Distribution (ESD)

μ_{A} = \frac{1}{n} i = 1 \sum n δ_{λ_{i}}

Explanation: The ESD is the uniform measure over eigenvalues of a Hermitian matrix A. In practice, we estimate it by a histogram of eigenvalues.

Spectral Norm

∥ W ∥_{2} = σ_{m a x} (W) = λ_{m a x} (W^{⊤} W)

Explanation: The largest singular value equals the square root of the largest eigenvalue of the Gram matrix $W^{T}$ W. It measures the maximum amplification by W.

Spectral Radius

ρ (W) = i max ∣ λ_{i} (W) ∣

Explanation: The spectral radius is the largest magnitude among eigenvalues. For recurrent dynamics, it governs stability of repeated application of W.

Stieltjes Transform

m_{A} (z) = \frac{1}{n} tr ((A - z I)^{- 1})

Explanation: This complex-valued function characterizes the spectrum of A. Its imaginary part along the real axis recovers the density of eigenvalues.

Marchenko–Pastur Law

p_{MP} (λ) = \frac{1}{2 π c λ} (λ_{+} - λ) (λ - λ_{-}), λ_{\pm} = (1 \pm c)^{2}

Explanation: For X with i.i.d. entries and aspect ratio $c = m / n$ , the eigenvalues of $\frac{1}{n} X^{T}$ X concentrate on [ $λ -,$ $λ +]$ with the above density. Deviations signal structure beyond randomness.

Gradient Descent Stability (Quadratic)

0 < η < \frac{2}{λ _{m a x} ( H )}

Explanation: For minimizing a quadratic with Hessian H using gradient descent, the step size must be below 2/ $λ_{m} a x (H)$ to ensure convergence.

Rayleigh Quotient and Power Iteration

r_{k} = \frac{v _{k}^{⊤} A v _{k}}{v _{k}^{⊤} v _{k}}, v_{k + 1} = \frac{A v _{k}}{∥ A v _{k} ∥}

Explanation: The Rayleigh quotient $r_{k}$ approximates the dominant eigenvalue of symmetric A as the normalized iterates $v_{k}$ converge to the top eigenvector.

Trace–Eigenvalue Identity

i = 1 \sum n λ_{i} (A) = tr (A)

Explanation: The sum of eigenvalues equals the trace, which is easy to compute. Useful for sanity checks and moment estimates.

Frobenius Norm and Singular Values

∥ W ∥_{F}^{2} = tr (W^{⊤} W) = i \sum σ_{i}^{2}

Explanation: The squared Frobenius norm equals the sum of squared singular values, relating entrywise scale to spectral scale.

Condition Number

κ_{2} (W) = \frac{σ _{m a x} ( W )}{σ _{m i n} ( W )}

Explanation: A large condition number indicates ill-conditioning and potential optimization slowdowns or instability.

Complexity Analysis

Computing the full spectrum of an n×n matrix via dense eigendecomposition (e.g., QR algorithm) costs O(

n^{3}

) time and O(

n^{2}

) space. For a rectangular m×n weight W with m≈n≈d, forming the Gram matrix

W^{T}

W already costs O(m

n^{2}

) time and O(

n^{2}

) space, and the subsequent eigendecomposition is O(

n^{3}

). This makes exact ESD computation infeasible for large layers (e.g.,

d \geq 1 0^{4}

). Iterative methods substantially reduce costs when only partial information is needed. Power iteration for the spectral norm of W uses repeated products

v \leftarrow W^{T} (W

v), each costing O(m n); with k iterations, time is O(k m n) and space is O(m+n), avoiding explicit Gram formation. Convergence is geometric with rate |

λ 2/ λ 1∣

for symmetric matrices, so a moderate k (e.g., 20–100) often suffices. Lanczos methods approximate multiple extremal eigenvalues or even the spectral density through a small tridiagonal surrogate in O(k m n) time and O(k n) space (storing k basis vectors). When estimating ESDs, stochastic variants (e.g., Hutchinson trace with Lanczos quadrature) trade a small bias for massive gains in scalability. For mini-batch tracking during training, amortizing k over epochs further reduces overhead. In summary: use O(

n^{3}

) full decompositions only for small layers or research prototypes; otherwise, prefer O(k m n) iterative methods with implicit products, ensuring numerical stability through normalization and occasional re-orthogonalization.

Code Examples

Tracking the largest singular value during training via implicit power iteration (no external libraries)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix {
5     int r, c;
6     vector<double> a; // row-major
7     Matrix() : r(0), c(0) {}
8     Matrix(int r_, int c_, double val=0.0) : r(r_), c(c_), a(r_*c_, val) {}
9     double &operator()(int i, int j) { return a[i*c + j]; }
10     const double &operator()(int i, int j) const { return a[i*c + j]; }
11 };
12 
13 // Multiply A (r x c) by vector x (size c) -> y (size r)
14 vector<double> matvec(const Matrix &A, const vector<double> &x){
15     vector<double> y(A.r, 0.0);
16     for(int i=0;i<A.r;++i){
17         const double *row = &A.a[i*A.c];
18         double s=0.0;
19         for(int j=0;j<A.c;++j) s += row[j] * x[j];
20         y[i]=s;
21     }
22     return y;
23 }
24 
25 // Multiply A^T by vector x (size r) -> y (size c)
26 vector<double> matTvec(const Matrix &A, const vector<double> &x){
27     vector<double> y(A.c, 0.0);
28     for(int i=0;i<A.r;++i){
29         double xi = x[i];
30         const double *row = &A.a[i*A.c];
31         for(int j=0;j<A.c;++j) y[j] += row[j] * xi;
32     }
33     return y;
34 }
35 
36 // Matrix-matrix multiply: C = A * B
37 Matrix matmul(const Matrix &A, const Matrix &B){
38     assert(A.c == B.r);
39     Matrix C(A.r, B.c, 0.0);
40     for(int i=0;i<A.r;++i){
41         for(int k=0;k<A.c;++k){
42             double aik = A(i,k);
43             for(int j=0;j<B.c;++j){
44                 C(i,j) += aik * B(k,j);
45             }
46         }
47     }
48     return C;
49 }
50 
51 Matrix transpose(const Matrix &A){
52     Matrix T(A.c, A.r, 0.0);
53     for(int i=0;i<A.r;++i)
54         for(int j=0;j<A.c;++j)
55             T(j,i) = A(i,j);
56     return T;
57 }
58 
59 // Spectral norm via power iteration on implicit Gram operator v -> A^T(A v)
60 // Returns an estimate of ||A||_2
61 double spectral_norm_power_iteration(const Matrix &A, int iters=50, unsigned seed=42){
62     mt19937 rng(seed);
63     normal_distribution<double> N(0.0,1.0);
64     // Work in A.c-dimensional space (columns of A)
65     vector<double> v(A.c);
66     for(double &vi : v) vi = N(rng);
67     auto normalize = [](vector<double> &x){
68         double n=0.0; for(double xi: x) n += xi*xi; n = sqrt(max(n,1e-30));
69         for(double &xi : x) xi /= n; return n; };
70     normalize(v);
71     double rayleigh = 0.0;
72     for(int t=0;t<iters;++t){
73         // z = A v
74         vector<double> z = matvec(A, v);
75         // v_next = A^T z = A^T (A v)
76         vector<double> v_next = matTvec(A, z);
77         // Rayleigh quotient for A^T A: r = (v^T (A^T A) v) / (v^T v) = ||A v||^2 / ||v||^2
78         double num=0.0; for(double zi: z) num += zi*zi;
79         double den=0.0; for(double vi: v) den += vi*vi;
80         rayleigh = (den>0? num/den : 0.0);
81         normalize(v_next);
82         v.swap(v_next);
83     }
84     return sqrt(max(0.0, rayleigh));
85 }
86 
87 // Simple 2-layer linear network: y_hat = X W1 W2; MSE loss; gradient descent
88 int main(){
89     ios::sync_with_stdio(false);
90     cin.tie(nullptr);
91 
92     int m=256;   // samples
93     int d=64;    // input dim
94     int h=64;    // hidden dim
95     int o=10;    // output dim
96     double lr = 0.05; // learning rate
97     int epochs = 50;
98 
99     mt19937 rng(1);
100     normal_distribution<double> N(0.0,1.0);
101 
102     Matrix X(m,d), W1(d,h), W2(h,o), Y(m,o);
103 
104     // Random data and target with a planted low-rank structure
105     for(double &v: X.a) v = N(rng) / sqrt(d);
106     Matrix U_true(d, 3), V_true(3, o);
107     for(double &v: U_true.a) v = N(rng)/sqrt(d);
108     for(double &v: V_true.a) v = N(rng)/sqrt(o);
109     Matrix W_true = matmul(U_true, V_true);
110     Matrix XW_true = matmul(X, W_true);
111     // Add small noise to targets
112     for(int i=0;i<m;++i)
113         for(int j=0;j<o;++j)
114             Y(i,j) = XW_true(i,j) + 0.05*N(rng);
115 
116     // Initialize weights
117     for(double &v: W1.a) v = N(rng) * sqrt(2.0/d);
118     for(double &v: W2.a) v = N(rng) * sqrt(2.0/h);
119 
120     auto mse = [&](const Matrix &P){
121         double s=0.0; for(double x: P.a) s += x*x; return 0.5*s / m; };
122 
123     cout.setf(std::ios::fixed); cout<<setprecision(6);
124     for(int e=0;e<epochs;++e){
125         // Forward: H = X W1; Yhat = H W2
126         Matrix H = matmul(X, W1);
127         Matrix Yhat = matmul(H, W2);
128         // Error and loss
129         Matrix E(m,o);
130         for(int i=0;i<m;++i) for(int j=0;j<o;++j) E(i,j) = Yhat(i,j) - Y(i,j);
131         double loss = mse(E);
132 
133         // Gradients: dW2 = (H^T E)/m ; dW1 = (X^T E W2^T)/m
134         Matrix HT = transpose(H);
135         Matrix dW2 = matmul(HT, E);
136         for(double &v: dW2.a) v /= m;
137         Matrix XT = transpose(X);
138         Matrix ET = transpose(E);
139         Matrix ETW2T = transpose(matmul(transpose(W2), ET)); // E * W2
140         Matrix dW1 = matmul(XT, ETW2T);
141         for(double &v: dW1.a) v /= m;
142 
143         // Update
144         for(size_t i=0;i<W1.a.size();++i) W1.a[i] -= lr * dW1.a[i];
145         for(size_t i=0;i<W2.a.size();++i) W2.a[i] -= lr * dW2.a[i];
146 
147         // Track spectral norms of W1, W2, and product W_eff = W1 W2 (via power iteration)
148         double s1 = spectral_norm_power_iteration(W1, 40, 100+e);
149         double s2 = spectral_norm_power_iteration(W2, 40, 200+e);
150         Matrix Weff = matmul(W1, W2);
151         double se = spectral_norm_power_iteration(Weff, 40, 300+e);
152 
153         cout << "epoch "<<setw(2)<<e
154              << "  loss="<<loss
155              << "  ||W1||2="<<s1
156              << "  ||W2||2="<<s2
157              << "  ||W1W2||2="<<se << "\n";
158     }
159     return 0;
160 }
161

This program trains a small 2-layer linear network with mean-squared error. After each epoch, it estimates the spectral norms of W1, W2, and their product W1W2 using power iteration without forming Gram matrices. The spectral norm tracks the maximum amplification of each layer, which is closely tied to stability and gradient behavior. The planted low-rank target induces outlier singular values as training progresses.

Time: Each epoch: forward/backward is O(m d h + m h o + d h o). Each spectral-norm estimate with k iterations costs O(k(mn)) for an m×n matrix. Overall per epoch is dominated by matrix multiplies and the O(k m n) power iterations.Space: O(md + dh + ho) to store data and weights, plus O(m + max(d,h,o)) for temporary vectors used in power iteration.

Full small-matrix spectrum and ESD histogram using Eigen (for research-scale demos)

1 #include <bits/stdc++.h>
2 #include <Eigen/Dense>
3 using namespace std;
4 using namespace Eigen;
5 
6 // Compile with: g++ -O2 -std=c++17 main.cpp -I /path/to/eigen
7 
8 int main(){
9     ios::sync_with_stdio(false);
10     cin.tie(nullptr);
11 
12     int m=512, d=64, h=64, o=16; // small so eigendecomposition is feasible
13     double lr = 0.05; int epochs = 30;
14 
15     std::mt19937 rng(42);
16     std::normal_distribution<double> N(0.0,1.0);
17 
18     auto randn = [&](int r, int c){ MatrixXd X(r,c); for(int i=0;i<r;i++) for(int j=0;j<c;j++) X(i,j)=N(rng); return X; };
19 
20     MatrixXd X = randn(m,d) / sqrt(d);
21     MatrixXd W1 = randn(d,h) * sqrt(2.0/d);
22     MatrixXd b1 = MatrixXd::Zero(1,h);
23     MatrixXd W2 = randn(h,o) * sqrt(2.0/h);
24     MatrixXd b2 = MatrixXd::Zero(1,o);
25 
26     // Create targets with a planted low-rank signal through a ReLU layer
27     MatrixXd U = randn(d, 3) / sqrt(d);
28     MatrixXd V = randn(3, o) / sqrt(o);
29     MatrixXd Z = (X * U).array().max(0.0).matrix() * V; // nonlinearity
30     MatrixXd Y = Z + 0.05 * randn(m,o);
31 
32     auto relu = [](const MatrixXd &A){ return A.array().max(0.0).matrix(); };
33 
34     for(int e=0;e<epochs;++e){
35         // Forward
36         MatrixXd Hpre = X * W1; Hpre.rowwise() += b1; 
37         MatrixXd H = relu(Hpre);
38         MatrixXd Yhat = H * W2; Yhat.rowwise() += b2;
39         MatrixXd E = Yhat - Y; 
40         double loss = 0.5 * E.squaredNorm() / m;
41 
42         // Backward
43         MatrixXd dYhat = E / m;
44         MatrixXd dW2 = H.transpose() * dYhat;
45         MatrixXd db2 = dYhat.colwise().sum();
46         MatrixXd dH = dYhat * W2.transpose();
47         MatrixXd dHpre = dH.array() * (Hpre.array() > 0.0).cast<double>();
48         MatrixXd dW1 = X.transpose() * dHpre;
49         MatrixXd db1 = dHpre.colwise().sum();
50 
51         // Update
52         W2 -= lr * dW2; b2 -= lr * db2;
53         W1 -= lr * dW1; b1 -= lr * db1;
54 
55         cout.setf(std::ios::fixed); cout<<setprecision(6);
56         cout << "epoch "<<setw(2)<<e<<"  loss="<<loss;
57 
58         // Exact singular values of W1 (small h,d) and eigenvalues of symmetric part
59         JacobiSVD<MatrixXd> svd(W1, ComputeThinU | ComputeThinV);
60         VectorXd s = svd.singularValues();
61         double smax = s(0);
62         cout << "  ||W1||2="<<smax;
63 
64         // Symmetric part eigenvalues (real): S = (W1 + W1^T)/2
65         MatrixXd S = 0.5*(W1 + W1.transpose());
66         SelfAdjointEigenSolver<MatrixXd> es(S);
67         VectorXd evals = es.eigenvalues();
68         double lam_max = evals.maxCoeff();
69         cout << "  lam_max(S)="<<lam_max << "\n";
70     }
71 
72     // Final: ESD histogram of W1^T W1 (squared singular values)
73     MatrixXd G = W1.transpose() * W1; // positive semidefinite
74     SelfAdjointEigenSolver<MatrixXd> esG(G);
75     VectorXd lams = esG.eigenvalues(); // these are sigma_i^2
76 
77     // Histogram
78     int bins = 20;
79     double mn = lams.minCoeff();
80     double mx = lams.maxCoeff();
81     vector<int> hist(bins,0);
82     for(int i=0;i<lams.size();++i){
83         int b = (mx==mn)?0: (int)((lams(i)-mn)/(mx-mn) * bins);
84         if(b==bins) b=bins-1;
85         hist[b]++;
86     }
87     cout << "\nESD of W1^T W1 (sigma^2) histogram:" << "\n";
88     for(int b=0;b<bins;++b){
89         double left = mn + (mx-mn)*b/bins;
90         double right = mn + (mx-mn)*(b+1)/bins;
91         cout << fixed << setprecision(4) << "["<<left<<","<<right<<"): ";
92         int cnt = hist[b];
93         int stars = min(60, cnt);
94         for(int s=0;s<stars;++s) cout << '*';
95         cout << " ("<<cnt<<")\n";
96     }
97 
98     return 0;
99 }
100

This example trains a small 1-hidden-layer ReLU network (so that exact decompositions are feasible) and then computes: (1) the exact singular values of W1, (2) eigenvalues of its symmetric part S=(W1+W1^T)/2 each epoch, and (3) a final histogram of eigenvalues of W1^T W1 (squared singular values), which is an ESD estimate. This gives a compact, practical workflow to inspect spectra and their evolution during training.

Time: Per epoch: O(m d h + m h o) for forward/backward. SVD/eigendecompositions on d×h or h×h scale like O(min(d,h)^2 max(d,h)). Final ESD via eigendecomposition of h×h Gram costs O(h^3).Space: O(md + dh + ho) for data and parameters; O(h^2) extra for Gram and eigen-solvers.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix {
5	int r, c;
6	vector<double> a; // row-major
7	Matrix() : r(0), c(0) {}
8	Matrix(int r_, int c_, double val=0.0) : r(r_), c(c_), a(r_*c_, val) {}
9	double &operator()(int i, int j) { return a[i*c + j]; }
10	const double &operator()(int i, int j) const { return a[i*c + j]; }
11	};
12
13	// Multiply A (r x c) by vector x (size c) -> y (size r)
14	vector<double> matvec(const Matrix &A, const vector<double> &x){
15	vector<double> y(A.r, 0.0);
16	for(int i=0;i<A.r;++i){
17	const double row = &A.a[iA.c];
18	double s=0.0;
19	for(int j=0;j<A.c;++j) s += row[j] * x[j];
20	y[i]=s;
21	}
22	return y;
23	}
24
25	// Multiply A^T by vector x (size r) -> y (size c)
26	vector<double> matTvec(const Matrix &A, const vector<double> &x){
27	vector<double> y(A.c, 0.0);
28	for(int i=0;i<A.r;++i){
29	double xi = x[i];
30	const double row = &A.a[iA.c];
31	for(int j=0;j<A.c;++j) y[j] += row[j] * xi;
32	}
33	return y;
34	}
35
36	// Matrix-matrix multiply: C = A * B
37	Matrix matmul(const Matrix &A, const Matrix &B){
38	assert(A.c == B.r);
39	Matrix C(A.r, B.c, 0.0);
40	for(int i=0;i<A.r;++i){
41	for(int k=0;k<A.c;++k){
42	double aik = A(i,k);
43	for(int j=0;j<B.c;++j){
44	C(i,j) += aik * B(k,j);
45	}
46	}
47	}
48	return C;
49	}
50
51	Matrix transpose(const Matrix &A){
52	Matrix T(A.c, A.r, 0.0);
53	for(int i=0;i<A.r;++i)
54	for(int j=0;j<A.c;++j)
55	T(j,i) = A(i,j);
56	return T;
57	}
58
59	// Spectral norm via power iteration on implicit Gram operator v -> A^T(A v)
60	// Returns an estimate of \|\|A\|\|_2
61	double spectral_norm_power_iteration(const Matrix &A, int iters=50, unsigned seed=42){
62	mt19937 rng(seed);
63	normal_distribution<double> N(0.0,1.0);
64	// Work in A.c-dimensional space (columns of A)
65	vector<double> v(A.c);
66	for(double &vi : v) vi = N(rng);
67	auto normalize = [](vector<double> &x){
68	double n=0.0; for(double xi: x) n += xi*xi; n = sqrt(max(n,1e-30));
69	for(double &xi : x) xi /= n; return n; };
70	normalize(v);
71	double rayleigh = 0.0;
72	for(int t=0;t<iters;++t){
73	// z = A v
74	vector<double> z = matvec(A, v);
75	// v_next = A^T z = A^T (A v)
76	vector<double> v_next = matTvec(A, z);
77	// Rayleigh quotient for A^T A: r = (v^T (A^T A) v) / (v^T v) = \|\|A v\|\|^2 / \|\|v\|\|^2
78	double num=0.0; for(double zi: z) num += zi*zi;
79	double den=0.0; for(double vi: v) den += vi*vi;
80	rayleigh = (den>0? num/den : 0.0);
81	normalize(v_next);
82	v.swap(v_next);
83	}
84	return sqrt(max(0.0, rayleigh));
85	}
86
87	// Simple 2-layer linear network: y_hat = X W1 W2; MSE loss; gradient descent
88	int main(){
89	ios::sync_with_stdio(false);
90	cin.tie(nullptr);
91
92	int m=256; // samples
93	int d=64; // input dim
94	int h=64; // hidden dim
95	int o=10; // output dim
96	double lr = 0.05; // learning rate
97	int epochs = 50;
98
99	mt19937 rng(1);
100	normal_distribution<double> N(0.0,1.0);
101
102	Matrix X(m,d), W1(d,h), W2(h,o), Y(m,o);
103
104	// Random data and target with a planted low-rank structure
105	for(double &v: X.a) v = N(rng) / sqrt(d);
106	Matrix U_true(d, 3), V_true(3, o);
107	for(double &v: U_true.a) v = N(rng)/sqrt(d);
108	for(double &v: V_true.a) v = N(rng)/sqrt(o);
109	Matrix W_true = matmul(U_true, V_true);
110	Matrix XW_true = matmul(X, W_true);
111	// Add small noise to targets
112	for(int i=0;i<m;++i)
113	for(int j=0;j<o;++j)
114	Y(i,j) = XW_true(i,j) + 0.05*N(rng);
115
116	// Initialize weights
117	for(double &v: W1.a) v = N(rng) * sqrt(2.0/d);
118	for(double &v: W2.a) v = N(rng) * sqrt(2.0/h);
119
120	auto mse = [&](const Matrix &P){
121	double s=0.0; for(double x: P.a) s += xx; return 0.5s / m; };
122
123	cout.setf(std::ios::fixed); cout<<setprecision(6);
124	for(int e=0;e<epochs;++e){
125	// Forward: H = X W1; Yhat = H W2
126	Matrix H = matmul(X, W1);
127	Matrix Yhat = matmul(H, W2);
128	// Error and loss
129	Matrix E(m,o);
130	for(int i=0;i<m;++i) for(int j=0;j<o;++j) E(i,j) = Yhat(i,j) - Y(i,j);
131	double loss = mse(E);
132
133	// Gradients: dW2 = (H^T E)/m ; dW1 = (X^T E W2^T)/m
134	Matrix HT = transpose(H);
135	Matrix dW2 = matmul(HT, E);
136	for(double &v: dW2.a) v /= m;
137	Matrix XT = transpose(X);
138	Matrix ET = transpose(E);
139	Matrix ETW2T = transpose(matmul(transpose(W2), ET)); // E * W2
140	Matrix dW1 = matmul(XT, ETW2T);
141	for(double &v: dW1.a) v /= m;
142
143	// Update
144	for(size_t i=0;i<W1.a.size();++i) W1.a[i] -= lr * dW1.a[i];
145	for(size_t i=0;i<W2.a.size();++i) W2.a[i] -= lr * dW2.a[i];
146
147	// Track spectral norms of W1, W2, and product W_eff = W1 W2 (via power iteration)
148	double s1 = spectral_norm_power_iteration(W1, 40, 100+e);
149	double s2 = spectral_norm_power_iteration(W2, 40, 200+e);
150	Matrix Weff = matmul(W1, W2);
151	double se = spectral_norm_power_iteration(Weff, 40, 300+e);
152
153	cout << "epoch "<<setw(2)<<e
154	<< " loss="<<loss
155	<< " \|\|W1\|\|2="<<s1
156	<< " \|\|W2\|\|2="<<s2
157	<< " \|\|W1W2\|\|2="<<se << "\n";
158	}
159	return 0;
160	}
161

1	#include <bits/stdc++.h>
2	#include <Eigen/Dense>
3	using namespace std;
4	using namespace Eigen;
5
6	// Compile with: g++ -O2 -std=c++17 main.cpp -I /path/to/eigen
7
8	int main(){
9	ios::sync_with_stdio(false);
10	cin.tie(nullptr);
11
12	int m=512, d=64, h=64, o=16; // small so eigendecomposition is feasible
13	double lr = 0.05; int epochs = 30;
14
15	std::mt19937 rng(42);
16	std::normal_distribution<double> N(0.0,1.0);
17
18	auto randn = [&](int r, int c){ MatrixXd X(r,c); for(int i=0;i<r;i++) for(int j=0;j<c;j++) X(i,j)=N(rng); return X; };
19
20	MatrixXd X = randn(m,d) / sqrt(d);
21	MatrixXd W1 = randn(d,h) * sqrt(2.0/d);
22	MatrixXd b1 = MatrixXd::Zero(1,h);
23	MatrixXd W2 = randn(h,o) * sqrt(2.0/h);
24	MatrixXd b2 = MatrixXd::Zero(1,o);
25
26	// Create targets with a planted low-rank signal through a ReLU layer
27	MatrixXd U = randn(d, 3) / sqrt(d);
28	MatrixXd V = randn(3, o) / sqrt(o);
29	MatrixXd Z = (X * U).array().max(0.0).matrix() * V; // nonlinearity
30	MatrixXd Y = Z + 0.05 * randn(m,o);
31
32	auto relu = [](const MatrixXd &A){ return A.array().max(0.0).matrix(); };
33
34	for(int e=0;e<epochs;++e){
35	// Forward
36	MatrixXd Hpre = X * W1; Hpre.rowwise() += b1;
37	MatrixXd H = relu(Hpre);
38	MatrixXd Yhat = H * W2; Yhat.rowwise() += b2;
39	MatrixXd E = Yhat - Y;
40	double loss = 0.5 * E.squaredNorm() / m;
41
42	// Backward
43	MatrixXd dYhat = E / m;
44	MatrixXd dW2 = H.transpose() * dYhat;
45	MatrixXd db2 = dYhat.colwise().sum();
46	MatrixXd dH = dYhat * W2.transpose();
47	MatrixXd dHpre = dH.array() * (Hpre.array() > 0.0).cast<double>();
48	MatrixXd dW1 = X.transpose() * dHpre;
49	MatrixXd db1 = dHpre.colwise().sum();
50
51	// Update
52	W2 -= lr * dW2; b2 -= lr * db2;
53	W1 -= lr * dW1; b1 -= lr * db1;
54
55	cout.setf(std::ios::fixed); cout<<setprecision(6);
56	cout << "epoch "<<setw(2)<<e<<" loss="<<loss;
57
58	// Exact singular values of W1 (small h,d) and eigenvalues of symmetric part
59	JacobiSVD<MatrixXd> svd(W1, ComputeThinU \| ComputeThinV);
60	VectorXd s = svd.singularValues();
61	double smax = s(0);
62	cout << " \|\|W1\|\|2="<<smax;
63
64	// Symmetric part eigenvalues (real): S = (W1 + W1^T)/2
65	MatrixXd S = 0.5*(W1 + W1.transpose());
66	SelfAdjointEigenSolver<MatrixXd> es(S);
67	VectorXd evals = es.eigenvalues();
68	double lam_max = evals.maxCoeff();
69	cout << " lam_max(S)="<<lam_max << "\n";
70	}
71
72	// Final: ESD histogram of W1^T W1 (squared singular values)
73	MatrixXd G = W1.transpose() * W1; // positive semidefinite
74	SelfAdjointEigenSolver<MatrixXd> esG(G);
75	VectorXd lams = esG.eigenvalues(); // these are sigma_i^2
76
77	// Histogram
78	int bins = 20;
79	double mn = lams.minCoeff();
80	double mx = lams.maxCoeff();
81	vector<int> hist(bins,0);
82	for(int i=0;i<lams.size();++i){
83	int b = (mx==mn)?0: (int)((lams(i)-mn)/(mx-mn) * bins);
84	if(b==bins) b=bins-1;
85	hist[b]++;
86	}
87	cout << "\nESD of W1^T W1 (sigma^2) histogram:" << "\n";
88	for(int b=0;b<bins;++b){
89	double left = mn + (mx-mn)*b/bins;
90	double right = mn + (mx-mn)*(b+1)/bins;
91	cout << fixed << setprecision(4) << "["<<left<<","<<right<<"): ";
92	int cnt = hist[b];
93	int stars = min(60, cnt);
94	for(int s=0;s<stars;++s) cout << '*';
95	cout << " ("<<cnt<<")\n";
96	}
97
98	return 0;
99	}
100