🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
⏱CoachđŸ§©Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Spectral Analysis of Neural Networks

Key Points

  • ‱
    Spectral analysis studies the distribution of eigenvalues and singular values of neural network weight matrices during training.
  • ‱
    The empirical spectral distribution (ESD) often shows a random-matrix “bulk” plus a few outlier eigenvalues that capture learned structure.
  • ‱
    The largest singular value (spectral norm) controls gradient amplification and stable learning-rate choices.
  • ‱
    For non-symmetric weight matrices, singular values are usually more informative than eigenvalues; use WT W to get a symmetric, positive semidefinite matrix.
  • ‱
    Random Matrix Theory (RMT) provides useful baselines such as the Marchenko–Pastur law to compare with observed spectra.
  • ‱
    Tracking spectra across epochs can diagnose exploding/vanishing gradients, overfitting, and implicit regularization.
  • ‱
    Exact full spectra cost O(n3); scalable approximations (power iteration, Lanczos) run in O(k n2) for large layers.
  • ‱
    In practice, analyze the symmetric part (W + WT)/2 or the Gram matrix WT W, and use histograms to inspect the ESD.

Prerequisites

  • →Linear algebra (vectors, matrices, eigenvalues/SVD) — Spectral analysis relies on eigenvalues and singular values, Gram matrices, and orthogonality.
  • →Probability and Random Matrix Theory (basics) — MP law and bulk/outlier interpretations require understanding of i.i.d. models and limiting distributions.
  • →Gradient-based optimization — Learning-rate stability bounds involve top eigenvalues of Hessians or Lipschitz constants from spectral norms.
  • →Neural network architectures (linear layers, ReLU, CNNs) — You must know how weights map to matrices to form W, W^T W, or symmetric parts for analysis.
  • →Numerical methods — Iterative eigensolvers (power iteration, Lanczos) and stability concerns are essential for scalable analysis.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral analysis of neural networks examines the eigenvalues and singular values of matrices that arise in learning—primarily weight matrices, but also Hessians, Jacobians, and Gram/NTK matrices. During training, these spectra evolve in characteristic ways: early on they can resemble random matrices, while later certain directions (outliers) emerge that reflect the learned patterns in data. By inspecting the empirical spectral distribution (ESD)—the histogram of eigenvalues or singular values—we gain a window into stability (e.g., whether gradients explode), expressivity (which directions dominate), and generalization (how heavy-tailed the spectrum is). Random Matrix Theory (RMT) provides a powerful reference: purely random weights have predictable spectra (e.g., Marchenko–Pastur for singular values), so deviations from these baselines indicate learned structure or pathologies. Practically, full eigenvalue decompositions are expensive for modern networks, but scalable approximations—tracking just the top singular value (spectral norm) or using Krylov methods—are often sufficient. This perspective is model-agnostic: it applies to linear layers and, with care, to convolutional and attention blocks by reshaping operators into matrices. Overall, spectral analysis offers a principled, quantitative tool to monitor and guide training beyond simple scalar metrics like loss and accuracy.

02Intuition & Analogies

Think of a weight matrix as a machine that stretches and rotates space. Drop a bunch of arrows (vectors) into this machine: some directions get stretched a lot, others compressed. The stretching factors are the singular values; the most-stretched direction corresponds to the largest singular value. If that top stretch is huge, tiny errors can get blown up—like whispering into a loudspeaker cranked to max—leading to unstable training. Conversely, if everything is squashed, signals and gradients fade away. Now imagine plotting how much stretching happens in all directions as a histogram: that’s the empirical spectral distribution. For a brand-new, randomly initialized network, the histogram typically has a smooth “bulk” predicted by probability theory (RMT). As the model learns structure in data, a few bars start poking out from the bulk: these are outliers, representing directions that the network has tuned to be especially sensitive or expressive. You can think of the bulk as background noise and the outliers as the melody—it’s where the learning lives. Over training, watching those bars move helps you see if your learning rate is reasonable (top stretch not exploding), if regularization is working (not too many huge bars), and if you’re starting to overfit (heavy tail growth). For non-symmetric matrices (most weights), eigenvalues can be complex and tricky; instead, look at the symmetric part or at W^T W, whose eigenvalues are the squared singular values—clean, real, and easy to interpret.

03Formal Definition

Given a matrix W ∈ Rm×n, its singular values \{σi​\}_{i=1}^{\min(m,n)} are the square roots of the eigenvalues of W⊀W. The spectral norm is \∣W∄_2 = maxi​ σi​, which equals the operator norm induced by the Euclidean norm. For a square matrix W ∈ Rn×n, eigenvalues \{λi​\}_{i=1}^n solve det(W-λ I)=0 and may be complex if W is not symmetric. The empirical spectral distribution (ESD) of a Hermitian matrix A ∈ Rn×n with eigenvalues λ1​,
,λn​ is ÎŒA​ = n1​∑i=1n​ Ύλi​​, i.e., the uniform measure on its eigenvalues. For non-symmetric W, we typically analyze A=W⊀W (positive semidefinite), so ÎŒA​ is defined on R≄0​ and corresponds to squared singular values. Random Matrix Theory predicts baseline ESDs for matrices with i.i.d. entries: the Marchenko–Pastur law describes the limit of ÎŒn1​X⊀X​ for rectangular Gaussian X as dimensions grow with fixed aspect ratio. Training modifies W so its Gram spectra deviate from these baselines, often producing outliers beyond the bulk edge. The spectral radius ρ(W)=maxi​ ∣λi​∣ and condition number Îș2​(W)=σmax​/σmin​ are key stability indicators. In optimization, step-size stability for gradient descent on a quadratic f(x)=21​x⊀Hx-b⊀x requires 0<η<2/λmax​(H).

04When to Use

Use spectral analysis when you want to diagnose or improve training stability and generalization. Concretely: (1) Before training, inspect |W|2 for new architectures or initializations; very large spectral norms suggest gradients may explode, motivating scaling or spectral norm regularization. (2) During training, track the top singular value of key layers; sudden growth can signal that the learning rate is too high. (3) To study generalization, compare the ESD of W^{\top}W to the Marchenko–Pastur bulk: outliers indicate learned structure; heavy tails can correlate with over-parameterization and implicit regularization effects. (4) For second-order diagnostics, examine the Hessian’s top eigenvalues to gauge curvature; large \lambda{\max}(H) suggests reducing the learning rate or using adaptive/second-order methods. (5) In pruning or compression, small singular values indicate near-redundant directions that can be truncated with limited performance impact. (6) For safety/robustness, large spectral norms in early layers imply strong input amplification and potential adversarial vulnerability; spectral normalization can help. (7) In sequence models and RNNs, keeping spectral radius near 1 in recurrent matrices mitigates vanishing/exploding gradients.

⚠Common Mistakes

  • Confusing eigenvalues with singular values for non-symmetric weights. Eigenvalues can be complex and do not directly measure amplification; use singular values via W^{\top}W instead.\n- Looking only at the largest singular value and ignoring the distribution. The bulk shape, outliers, and tails carry distinct information about learned structure and regularization.\n- Misapplying random-matrix baselines. Marchenko–Pastur assumes i.i.d. entries; convolutions, normalization layers, and weight sharing break assumptions, so treat RMT as a qualitative guide, not a strict test.\n- Forming W^{\top}W explicitly for huge layers. This squares the condition number and costs O(n^2 m) memory/time; prefer implicit products A v = W^{\top}(W v) for iterative methods.\n- Using power iteration without normalization or with too few iterations, leading to poor estimates of the spectral norm; monitor the Rayleigh quotient and convergence.\n- Interpreting negative eigenvalues of the symmetric part (W+W^{\top})/2 as instability in forward pass; they relate to non-normality and asymmetric components—prefer singular values for amplification.\n- Ignoring scaling. Compare spectra across epochs using consistent normalization (e.g., by layer width) or else apparent changes may just reflect size, not learning.\n- Relying on tiny samples for histograms. Use enough bins and samples; for large matrices, adopt stochastic methods (Lanczos, Hutchinson) to estimate spectral densities reliably.

Key Formulas

Empirical Spectral Distribution (ESD)

ÎŒA​=n1​i=1∑n​Ύλi​​

Explanation: The ESD is the uniform measure over eigenvalues of a Hermitian matrix A. In practice, we estimate it by a histogram of eigenvalues.

Spectral Norm

∄W∄2​=σmax​(W)=λmax​(W⊀W)​

Explanation: The largest singular value equals the square root of the largest eigenvalue of the Gram matrix WT W. It measures the maximum amplification by W.

Spectral Radius

ρ(W)=imaxâ€‹âˆŁÎ»i​(W)∣

Explanation: The spectral radius is the largest magnitude among eigenvalues. For recurrent dynamics, it governs stability of repeated application of W.

Stieltjes Transform

mA​(z)=n1​tr((A−zI)−1)

Explanation: This complex-valued function characterizes the spectrum of A. Its imaginary part along the real axis recovers the density of eigenvalues.

Marchenko–Pastur Law

pMP​(λ)=2πcλ1​(λ+​−λ)(λ−λ−​)​,λ±​=(1±c​)2

Explanation: For X with i.i.d. entries and aspect ratio c=m/n, the eigenvalues of n1​XT X concentrate on [λ−, λ+] with the above density. Deviations signal structure beyond randomness.

Gradient Descent Stability (Quadratic)

0<η<λmax​(H)2​

Explanation: For minimizing a quadratic with Hessian H using gradient descent, the step size must be below 2/λm​ax(H) to ensure convergence.

Rayleigh Quotient and Power Iteration

rk​=vk⊀​vk​vk⊀​Avk​​,vk+1​=∄Avk​∄Avk​​

Explanation: The Rayleigh quotient rk​ approximates the dominant eigenvalue of symmetric A as the normalized iterates vk​ converge to the top eigenvector.

Trace–Eigenvalue Identity

i=1∑n​λi​(A)=tr(A)

Explanation: The sum of eigenvalues equals the trace, which is easy to compute. Useful for sanity checks and moment estimates.

Frobenius Norm and Singular Values

∄W∄F2​=tr(W⊀W)=i∑​σi2​

Explanation: The squared Frobenius norm equals the sum of squared singular values, relating entrywise scale to spectral scale.

Condition Number

Îș2​(W)=σmin​(W)σmax​(W)​

Explanation: A large condition number indicates ill-conditioning and potential optimization slowdowns or instability.

Complexity Analysis

Computing the full spectrum of an n×n matrix via dense eigendecomposition (e.g., QR algorithm) costs O(n3) time and O(n2) space. For a rectangular m×n weight W with m≈n≈d, forming the Gram matrix WT W already costs O(m n2) time and O(n2) space, and the subsequent eigendecomposition is O(n3). This makes exact ESD computation infeasible for large layers (e.g., d≄104). Iterative methods substantially reduce costs when only partial information is needed. Power iteration for the spectral norm of W uses repeated products v←WT(W v), each costing O(m n); with k iterations, time is O(k m n) and space is O(m+n), avoiding explicit Gram formation. Convergence is geometric with rate |λ2/λ1∣ for symmetric matrices, so a moderate k (e.g., 20–100) often suffices. Lanczos methods approximate multiple extremal eigenvalues or even the spectral density through a small tridiagonal surrogate in O(k m n) time and O(k n) space (storing k basis vectors). When estimating ESDs, stochastic variants (e.g., Hutchinson trace with Lanczos quadrature) trade a small bias for massive gains in scalability. For mini-batch tracking during training, amortizing k over epochs further reduces overhead. In summary: use O(n3) full decompositions only for small layers or research prototypes; otherwise, prefer O(k m n) iterative methods with implicit products, ensuring numerical stability through normalization and occasional re-orthogonalization.

Code Examples

Tracking the largest singular value during training via implicit power iteration (no external libraries)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix {
5 int r, c;
6 vector<double> a; // row-major
7 Matrix() : r(0), c(0) {}
8 Matrix(int r_, int c_, double val=0.0) : r(r_), c(c_), a(r_*c_, val) {}
9 double &operator()(int i, int j) { return a[i*c + j]; }
10 const double &operator()(int i, int j) const { return a[i*c + j]; }
11};
12
13// Multiply A (r x c) by vector x (size c) -> y (size r)
14vector<double> matvec(const Matrix &A, const vector<double> &x){
15 vector<double> y(A.r, 0.0);
16 for(int i=0;i<A.r;++i){
17 const double *row = &A.a[i*A.c];
18 double s=0.0;
19 for(int j=0;j<A.c;++j) s += row[j] * x[j];
20 y[i]=s;
21 }
22 return y;
23}
24
25// Multiply A^T by vector x (size r) -> y (size c)
26vector<double> matTvec(const Matrix &A, const vector<double> &x){
27 vector<double> y(A.c, 0.0);
28 for(int i=0;i<A.r;++i){
29 double xi = x[i];
30 const double *row = &A.a[i*A.c];
31 for(int j=0;j<A.c;++j) y[j] += row[j] * xi;
32 }
33 return y;
34}
35
36// Matrix-matrix multiply: C = A * B
37Matrix matmul(const Matrix &A, const Matrix &B){
38 assert(A.c == B.r);
39 Matrix C(A.r, B.c, 0.0);
40 for(int i=0;i<A.r;++i){
41 for(int k=0;k<A.c;++k){
42 double aik = A(i,k);
43 for(int j=0;j<B.c;++j){
44 C(i,j) += aik * B(k,j);
45 }
46 }
47 }
48 return C;
49}
50
51Matrix transpose(const Matrix &A){
52 Matrix T(A.c, A.r, 0.0);
53 for(int i=0;i<A.r;++i)
54 for(int j=0;j<A.c;++j)
55 T(j,i) = A(i,j);
56 return T;
57}
58
59// Spectral norm via power iteration on implicit Gram operator v -> A^T(A v)
60// Returns an estimate of ||A||_2
61double spectral_norm_power_iteration(const Matrix &A, int iters=50, unsigned seed=42){
62 mt19937 rng(seed);
63 normal_distribution<double> N(0.0,1.0);
64 // Work in A.c-dimensional space (columns of A)
65 vector<double> v(A.c);
66 for(double &vi : v) vi = N(rng);
67 auto normalize = [](vector<double> &x){
68 double n=0.0; for(double xi: x) n += xi*xi; n = sqrt(max(n,1e-30));
69 for(double &xi : x) xi /= n; return n; };
70 normalize(v);
71 double rayleigh = 0.0;
72 for(int t=0;t<iters;++t){
73 // z = A v
74 vector<double> z = matvec(A, v);
75 // v_next = A^T z = A^T (A v)
76 vector<double> v_next = matTvec(A, z);
77 // Rayleigh quotient for A^T A: r = (v^T (A^T A) v) / (v^T v) = ||A v||^2 / ||v||^2
78 double num=0.0; for(double zi: z) num += zi*zi;
79 double den=0.0; for(double vi: v) den += vi*vi;
80 rayleigh = (den>0? num/den : 0.0);
81 normalize(v_next);
82 v.swap(v_next);
83 }
84 return sqrt(max(0.0, rayleigh));
85}
86
87// Simple 2-layer linear network: y_hat = X W1 W2; MSE loss; gradient descent
88int main(){
89 ios::sync_with_stdio(false);
90 cin.tie(nullptr);
91
92 int m=256; // samples
93 int d=64; // input dim
94 int h=64; // hidden dim
95 int o=10; // output dim
96 double lr = 0.05; // learning rate
97 int epochs = 50;
98
99 mt19937 rng(1);
100 normal_distribution<double> N(0.0,1.0);
101
102 Matrix X(m,d), W1(d,h), W2(h,o), Y(m,o);
103
104 // Random data and target with a planted low-rank structure
105 for(double &v: X.a) v = N(rng) / sqrt(d);
106 Matrix U_true(d, 3), V_true(3, o);
107 for(double &v: U_true.a) v = N(rng)/sqrt(d);
108 for(double &v: V_true.a) v = N(rng)/sqrt(o);
109 Matrix W_true = matmul(U_true, V_true);
110 Matrix XW_true = matmul(X, W_true);
111 // Add small noise to targets
112 for(int i=0;i<m;++i)
113 for(int j=0;j<o;++j)
114 Y(i,j) = XW_true(i,j) + 0.05*N(rng);
115
116 // Initialize weights
117 for(double &v: W1.a) v = N(rng) * sqrt(2.0/d);
118 for(double &v: W2.a) v = N(rng) * sqrt(2.0/h);
119
120 auto mse = [&](const Matrix &P){
121 double s=0.0; for(double x: P.a) s += x*x; return 0.5*s / m; };
122
123 cout.setf(std::ios::fixed); cout<<setprecision(6);
124 for(int e=0;e<epochs;++e){
125 // Forward: H = X W1; Yhat = H W2
126 Matrix H = matmul(X, W1);
127 Matrix Yhat = matmul(H, W2);
128 // Error and loss
129 Matrix E(m,o);
130 for(int i=0;i<m;++i) for(int j=0;j<o;++j) E(i,j) = Yhat(i,j) - Y(i,j);
131 double loss = mse(E);
132
133 // Gradients: dW2 = (H^T E)/m ; dW1 = (X^T E W2^T)/m
134 Matrix HT = transpose(H);
135 Matrix dW2 = matmul(HT, E);
136 for(double &v: dW2.a) v /= m;
137 Matrix XT = transpose(X);
138 Matrix ET = transpose(E);
139 Matrix ETW2T = transpose(matmul(transpose(W2), ET)); // E * W2
140 Matrix dW1 = matmul(XT, ETW2T);
141 for(double &v: dW1.a) v /= m;
142
143 // Update
144 for(size_t i=0;i<W1.a.size();++i) W1.a[i] -= lr * dW1.a[i];
145 for(size_t i=0;i<W2.a.size();++i) W2.a[i] -= lr * dW2.a[i];
146
147 // Track spectral norms of W1, W2, and product W_eff = W1 W2 (via power iteration)
148 double s1 = spectral_norm_power_iteration(W1, 40, 100+e);
149 double s2 = spectral_norm_power_iteration(W2, 40, 200+e);
150 Matrix Weff = matmul(W1, W2);
151 double se = spectral_norm_power_iteration(Weff, 40, 300+e);
152
153 cout << "epoch "<<setw(2)<<e
154 << " loss="<<loss
155 << " ||W1||2="<<s1
156 << " ||W2||2="<<s2
157 << " ||W1W2||2="<<se << "\n";
158 }
159 return 0;
160}
161

This program trains a small 2-layer linear network with mean-squared error. After each epoch, it estimates the spectral norms of W1, W2, and their product W1W2 using power iteration without forming Gram matrices. The spectral norm tracks the maximum amplification of each layer, which is closely tied to stability and gradient behavior. The planted low-rank target induces outlier singular values as training progresses.

Time: Each epoch: forward/backward is O(m d h + m h o + d h o). Each spectral-norm estimate with k iterations costs O(k(mn)) for an m×n matrix. Overall per epoch is dominated by matrix multiplies and the O(k m n) power iterations.Space: O(md + dh + ho) to store data and weights, plus O(m + max(d,h,o)) for temporary vectors used in power iteration.
Full small-matrix spectrum and ESD histogram using Eigen (for research-scale demos)
1#include <bits/stdc++.h>
2#include <Eigen/Dense>
3using namespace std;
4using namespace Eigen;
5
6// Compile with: g++ -O2 -std=c++17 main.cpp -I /path/to/eigen
7
8int main(){
9 ios::sync_with_stdio(false);
10 cin.tie(nullptr);
11
12 int m=512, d=64, h=64, o=16; // small so eigendecomposition is feasible
13 double lr = 0.05; int epochs = 30;
14
15 std::mt19937 rng(42);
16 std::normal_distribution<double> N(0.0,1.0);
17
18 auto randn = [&](int r, int c){ MatrixXd X(r,c); for(int i=0;i<r;i++) for(int j=0;j<c;j++) X(i,j)=N(rng); return X; };
19
20 MatrixXd X = randn(m,d) / sqrt(d);
21 MatrixXd W1 = randn(d,h) * sqrt(2.0/d);
22 MatrixXd b1 = MatrixXd::Zero(1,h);
23 MatrixXd W2 = randn(h,o) * sqrt(2.0/h);
24 MatrixXd b2 = MatrixXd::Zero(1,o);
25
26 // Create targets with a planted low-rank signal through a ReLU layer
27 MatrixXd U = randn(d, 3) / sqrt(d);
28 MatrixXd V = randn(3, o) / sqrt(o);
29 MatrixXd Z = (X * U).array().max(0.0).matrix() * V; // nonlinearity
30 MatrixXd Y = Z + 0.05 * randn(m,o);
31
32 auto relu = [](const MatrixXd &A){ return A.array().max(0.0).matrix(); };
33
34 for(int e=0;e<epochs;++e){
35 // Forward
36 MatrixXd Hpre = X * W1; Hpre.rowwise() += b1;
37 MatrixXd H = relu(Hpre);
38 MatrixXd Yhat = H * W2; Yhat.rowwise() += b2;
39 MatrixXd E = Yhat - Y;
40 double loss = 0.5 * E.squaredNorm() / m;
41
42 // Backward
43 MatrixXd dYhat = E / m;
44 MatrixXd dW2 = H.transpose() * dYhat;
45 MatrixXd db2 = dYhat.colwise().sum();
46 MatrixXd dH = dYhat * W2.transpose();
47 MatrixXd dHpre = dH.array() * (Hpre.array() > 0.0).cast<double>();
48 MatrixXd dW1 = X.transpose() * dHpre;
49 MatrixXd db1 = dHpre.colwise().sum();
50
51 // Update
52 W2 -= lr * dW2; b2 -= lr * db2;
53 W1 -= lr * dW1; b1 -= lr * db1;
54
55 cout.setf(std::ios::fixed); cout<<setprecision(6);
56 cout << "epoch "<<setw(2)<<e<<" loss="<<loss;
57
58 // Exact singular values of W1 (small h,d) and eigenvalues of symmetric part
59 JacobiSVD<MatrixXd> svd(W1, ComputeThinU | ComputeThinV);
60 VectorXd s = svd.singularValues();
61 double smax = s(0);
62 cout << " ||W1||2="<<smax;
63
64 // Symmetric part eigenvalues (real): S = (W1 + W1^T)/2
65 MatrixXd S = 0.5*(W1 + W1.transpose());
66 SelfAdjointEigenSolver<MatrixXd> es(S);
67 VectorXd evals = es.eigenvalues();
68 double lam_max = evals.maxCoeff();
69 cout << " lam_max(S)="<<lam_max << "\n";
70 }
71
72 // Final: ESD histogram of W1^T W1 (squared singular values)
73 MatrixXd G = W1.transpose() * W1; // positive semidefinite
74 SelfAdjointEigenSolver<MatrixXd> esG(G);
75 VectorXd lams = esG.eigenvalues(); // these are sigma_i^2
76
77 // Histogram
78 int bins = 20;
79 double mn = lams.minCoeff();
80 double mx = lams.maxCoeff();
81 vector<int> hist(bins,0);
82 for(int i=0;i<lams.size();++i){
83 int b = (mx==mn)?0: (int)((lams(i)-mn)/(mx-mn) * bins);
84 if(b==bins) b=bins-1;
85 hist[b]++;
86 }
87 cout << "\nESD of W1^T W1 (sigma^2) histogram:" << "\n";
88 for(int b=0;b<bins;++b){
89 double left = mn + (mx-mn)*b/bins;
90 double right = mn + (mx-mn)*(b+1)/bins;
91 cout << fixed << setprecision(4) << "["<<left<<","<<right<<"): ";
92 int cnt = hist[b];
93 int stars = min(60, cnt);
94 for(int s=0;s<stars;++s) cout << '*';
95 cout << " ("<<cnt<<")\n";
96 }
97
98 return 0;
99}
100

This example trains a small 1-hidden-layer ReLU network (so that exact decompositions are feasible) and then computes: (1) the exact singular values of W1, (2) eigenvalues of its symmetric part S=(W1+W1^T)/2 each epoch, and (3) a final histogram of eigenvalues of W1^T W1 (squared singular values), which is an ESD estimate. This gives a compact, practical workflow to inspect spectra and their evolution during training.

Time: Per epoch: O(m d h + m h o) for forward/backward. SVD/eigendecompositions on d×h or h×h scale like O(min(d,h)^2 max(d,h)). Final ESD via eigendecomposition of h×h Gram costs O(h^3).Space: O(md + dh + ho) for data and parameters; O(h^2) extra for Gram and eigen-solvers.
#spectral analysis#eigenvalues#singular values#spectral norm#spectral radius#empirical spectral distribution#marchenko–pastur#power iteration#lanczos#neural network training#hessian spectrum#random matrix theory#stability#generalization#esd histogram