🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathAdvanced

Marchenko-Pastur Distribution

Key Points

  • •
    The Marchenko–Pastur (MP) distribution describes the limiting eigenvalue distribution of sample covariance matrices S = n1​ XX⊤ when both the dimension p and the sample size n grow with p/n → γ.
  • •
    Its density has compact support on [a, b] with a = (1 - γ​)^{2} and b = (1 + γ​)^{2}, and an additional point mass at zero of weight 1 - 1/γ when γ > 1.
  • •
    The MP density inside [a, b] equals f_γ(x) = 2πγx(b−x)(x−a)​​. Outside this interval the density is zero.
  • •
    If data are whitened (entries i.i.d. mean 0, variance 1), the empirical spectral distribution of S converges to MP(γ); with variance σ2 it scales by σ2.
  • •
    The largest eigenvalue tends to b and the smallest nonzero tends to a (if γ ≤ 1); deviations at the edges are on the n−2/3 Tracy–Widom scale.
  • •
    MP theory underpins high-dimensional PCA, signal detection, and shrinkage: bulk eigenvalues follow MP, while true low-rank signals appear as outliers (spikes).
  • •
    Correct scaling by 1/n and the correct aspect ratio γ = p/n are essential; forgetting these leads to wrong comparisons to MP.
  • •
    Finite-sample spectra fluctuate around MP; simulations with moderate sizes already show good agreement, especially in the bulk.

Prerequisites

  • →Eigenvalues and eigenvectors — Understanding spectra of symmetric matrices is essential to interpret the MP law and PCA.
  • →Random variables and independence — The MP law assumes i.i.d. entries and relies on averaging properties across many samples.
  • →Law of Large Numbers and concentration — Convergence of empirical distributions and stability of spectra require averaging and concentration ideas.
  • →Covariance and sample covariance — MP describes the eigenvalues of the sample covariance matrix specifically.
  • →Stieltjes transform and complex analysis basics — The Stieltjes transform characterizes the MP distribution and enables numerical inversion.
  • →Numerical eigendecomposition methods — Computing empirical spectra requires practical algorithms like Jacobi or QR for symmetric matrices.

Detailed Explanation

Tap terms for definitions

01Overview

The Marchenko–Pastur (MP) distribution is a cornerstone of random matrix theory for statistics and machine learning. It predicts the asymptotic behavior of eigenvalues of sample covariance matrices built from high-dimensional data. Suppose you collect n samples of p features and form the p × p sample covariance matrix S = (1/n) XX^{\top}, where X is the p × n centered data matrix. When both p and n grow large such that their ratio p/n approaches a constant \gamma > 0, the histogram of eigenvalues of S converges to a deterministic curve: the Marchenko–Pastur law with parameter \gamma. This curve has a simple closed-form density on a compact interval [a, b], where a = (1 − \sqrt{\gamma})^{2} and b = (1 + \sqrt{\gamma})^{2}, capturing the entire “bulk” of the spectrum. If p exceeds n (\gamma > 1), there is an extra point mass at zero because S becomes rank-deficient. This phenomenon appears widely: in PCA of high-dimensional, noise-only data, the cloud of eigenvalues forms the MP bulk. If the data actually contain low-rank signals (“spikes”), those signals show up as eigenvalues outside the MP support. Thus, MP provides both a baseline for what noise looks like and a diagnostic for detecting structure. Its tools (density, Stieltjes transform, edge behavior) guide thresholding, shrinkage, and performance predictions in modern data analysis.

02Intuition & Analogies

Imagine recording the height of waves on a choppy sea using many sensors. Each sensor measures random noise of the same variance. When you compute the sample covariance across sensors, you’re mixing lots of small random contributions. In low dimensions (few sensors), the covariance eigenvalues bounce around without a neat pattern. But as the number of sensors and the number of recordings both grow large together, the randomness averages out into a stable shape—like how a large jar of mixed candies settles into a predictable color ratio. The aspect ratio \gamma = p/n is the recipe controlling this shape. If there are far more sensors than time points (\gamma > 1), many directions in sensor-space do not have enough data to be determined, so you get many zero eigenvalues—hence the point mass at zero. If sensors and samples are balanced, the energy spreads smoothly over a compact interval. The edges of this interval, a and b, act like guardrails: under pure noise, eigenvalues rarely go beyond them (up to finite-sample wiggles). From a PCA viewpoint, imagine your data are pure white noise. The MP law tells you exactly what spread of eigenvalues you should expect just from noise. If you see an eigenvalue poking above the top edge b, it’s like spotting a mountain above the clouds—it likely indicates a real underlying factor (a signal) rather than random fluctuation. This is why the MP distribution is so useful: it separates what is typical under randomness from what is exceptional and potentially meaningful.

03Formal Definition

Let X be a p × n random matrix with entries Xij​ that are i.i.d., centered, and have variance Var(Xij​) = σ2, with suitable moment conditions (e.g., finite fourth moment). Define the sample covariance matrix S = n1​ XX⊤ (p × p). Consider the empirical spectral distribution (ESD) μS​ = p1​ ∑i=1p​ δλi​​, where \{λi​\} are the eigenvalues of S. If p, n → ∞ such that p/n → γ ∈ (0, ∞), then μS​ converges weakly, almost surely, to the Marchenko–Pastur distribution MP_{γ,σ2}. For σ2 = 1, the MP density is supported on [a, b], with a = (1 - γ​)^{2} and b = (1 + γ​)^{2}, and has density fγ​(x) = 2πγx(b−x)(x−a)​​ \; 1[a,b]​(x). If γ > 1, there is additionally a point mass at zero of size 1 - 1/γ. For general σ2, the distribution scales: the support is [σ2 a, σ2 b], and the density is f_{γ,σ2}(x) = σ21​ fγ​(x/σ2). The Stieltjes transform m(z) = ∫ x−z1​\, dμ(x) of MPγ,1​ satisfies an explicit formula and the self-consistent equation z = -m(z)1​ + 1+m(z)γ​ for z in the complex upper half-plane (with the appropriate branch).

04When to Use

  • High-dimensional PCA benchmarking: Use MP as the null model for eigenvalues under white noise; eigenvalues above b suggest possible low-rank signals.
  • Covariance shrinkage and regularization: MP informs how much bulk eigenvalues are inflated by sampling noise, guiding shrinkage targets.
  • Model diagnostics in factor analysis: Compare the empirical spectrum to MP to check whether residuals behave like white noise.
  • Randomized linear algebra and signal detection: Predict typical singular value scales for random projections and noise-only matrices.
  • Wireless communications and information theory: Characterize capacity of MIMO channels via eigenvalue distributions akin to MP.
  • Stress testing algorithms: Simulate matrices with MP bulk to evaluate robustness of spectral clustering, graph embeddings, and whitening steps. Use MP when entries are approximately i.i.d., centered, and homoscedastic, and when both p and n are large with a stable aspect ratio \gamma. If data have strong correlations, heavy tails, or nonstationarities, MP may not apply directly without preprocessing (e.g., whitening, truncation).

⚠️Common Mistakes

  • Wrong scaling: Using S = XX^{\top} instead of S = (1/n) XX^{\top} shifts the entire spectrum by a factor of n. Always divide by n to compare to MP.
  • Confusing shapes: MP for S = (1/n) XX^{\top} (p × p) and for (1/n) X^{\top}X (n × n) share nonzero spectra, but their aspect ratio parameters invert; keep track of whether you analyze p/n or n/p and the atom at zero.
  • Ignoring the atom at zero: When \gamma > 1 (p > n), a mass 1 - 1/\gamma sits at zero; missing it skews density comparisons.
  • Non-i.i.d. or unwhitened data: MP assumes independent, equal-variance entries (after centering). Strong correlations or heteroscedasticity invalidate the MP baseline; whiten or model the structure first.
  • Finite-sample overinterpretation: Small deviations from [a, b] can occur for moderate sizes; edge fluctuations scale as n^{-2/3}. Use statistical thresholds (e.g., Tracy–Widom) rather than eyeballing.
  • Misreading support scaling: If entries have variance \sigma^{2} \neq 1, the support is [\sigma^{2} a, \sigma^{2} b]; forgetting \sigma^{2} leads to mismatched overlays.

Key Formulas

MP support

a=(1−γ​)2,b=(1+γ​)2

Explanation: These are the lower and upper edges of the Marchenko–Pastur density for unit variance. All bulk eigenvalues lie between a and b in the large-dimensional limit.

MP density

fγ​(x)=2πγx(b−x)(x−a)​​1[a,b]​(x)

Explanation: This is the limiting density of eigenvalues for S = n1​ XX⊤ with variance 1 entries. It is zero outside [a, b] and integrates to 1 when γ ≤ 1 (otherwise mass at 0 completes it).

Atom at zero

μatom at 0​=max{0,1−γ1​}

Explanation: When γ>1 (more features than samples), the covariance matrix has a fraction 1 - 1/γ of zero eigenvalues. When γ≤ 1 there is no atom at zero.

Stieltjes transform (closed form)

mγ​(z)=∫x−z1​dμγ​(x)=2γz1−γ−z+(z−a)(z−b)​​

Explanation: This analytic formula holds for z in the complex upper half-plane with the square-root branch chosen so that Im m(z) > 0. It characterizes MP and allows recovery of the density via its imaginary part.

Self-consistent equation

z=−m(z)1​+1+m(z)γ​

Explanation: The Stieltjes transform of MP solves this quadratic relationship. It is often used in proofs and numerical fixed-point schemes for related models.

Inversion of Stieltjes transform

f(x)=η↓0lim​π1​Imm(x+iη)

Explanation: The density can be recovered from the imaginary part of the Stieltjes transform approached from the upper half-plane. This provides a robust way to approximate densities numerically.

Variance scaling

μγ,σ2​(A)=μγ,1​(A/σ2)

Explanation: If entries have variance σ2, the MP distribution scales by σ2. The support becomes [σ2 a, σ2 b] and eigenvalues are multiplied by σ2.

Edge convergence

λmax​(S)a.s.​b,λmin+​(S)a.s.​a

Explanation: The largest eigenvalue converges almost surely to b, and the smallest nonzero eigenvalue converges to a when γ ≤ 1. This describes the limiting edges of the spectrum.

Mean of MP (unit variance)

∫xdμγ​(x)=1

Explanation: The MP distribution has mean 1 when entries are unit variance. This matches the fact that E[tr(S)/p] = 1 for whitened data.

ESD convergence

ESD(S)=p1​i=1∑p​δλi​(S)​a.s.​μγ​

Explanation: The empirical spectral distribution of the sample covariance converges weakly, almost surely, to the MP law as p, n → ∞ with p/n → γ.

Complexity Analysis

Let X be p × n. Forming the sample covariance S = n1​ XX⊤ naively costs O(p2 n) time and O(p2) extra space if you stream columns, or O(p n) space if you store X to reuse it. In the provided code we materialize X and then accumulate S, so memory is O(p n + p2). For dense data with p ≈ n, this dominates computation. Computing all eigenvalues of the symmetric p × p matrix S with a simple Jacobi method costs roughly O(p3) per full sweep; a few sweeps (tens to hundreds) are often enough for moderate accuracy, leading to practical time of c · O(p3). Professional libraries (LAPACK’s dsyev or Eigen) achieve O(p3) with better constants using tridiagonal reduction and QR. Thus, end-to-end, the simulation is O(p2 n + p3) time and O(p n + p2) space. Evaluating the MP density or Stieltjes transform on a grid of N points is cheap: O(N) time and O(1) space per point (constant work per x). Using the Stieltjes transform to approximate the density via f(x) ≈ (1/π) Im m(x + iη) with fixed η adds only constant-factor costs due to complex arithmetic. In practice: when n is much larger than p, forming S (O(p2 n)) is the bottleneck; when p is comparable to or larger than n, eigen-decomposition (O(p3)) dominates. For large-scale applications, randomized methods or iterative eigensolvers compute only a few top eigenvalues in O(p2 n + k p2) time, but here we deliberately compute the full spectrum for didactic purposes.

Code Examples

Simulate sample covariance eigenvalues and compare to Marchenko–Pastur density
1#include <bits/stdc++.h>
2using namespace std;
3
4// Index helper for row-major p x p matrices
5inline size_t idx(size_t i, size_t j, size_t p) { return i * p + j; }
6
7// Build S = (1/n) * X * X^T from X (p x n), with X stored row-major as vector<double>
8vector<double> sample_covariance(const vector<double>& X, size_t p, size_t n) {
9 vector<double> S(p * p, 0.0);
10 // Accumulate S_ij = (1/n) sum_k X_{i,k} X_{j,k}
11 for (size_t k = 0; k < n; ++k) {
12 // For column k, get pointer to X(:,k)
13 // Access X_{i,k} at X[i*n + k]
14 for (size_t i = 0; i < p; ++i) {
15 double xi = X[i * n + k];
16 for (size_t j = 0; j <= i; ++j) { // compute lower triangle, mirror later
17 S[idx(i, j, p)] += xi * X[j * n + k];
18 }
19 }
20 }
21 // Scale by 1/n and symmetrize
22 double invn = 1.0 / static_cast<double>(n);
23 for (size_t i = 0; i < p; ++i) {
24 for (size_t j = 0; j <= i; ++j) {
25 S[idx(i, j, p)] *= invn;
26 S[idx(j, i, p)] = S[idx(i, j, p)];
27 }
28 }
29 return S;
30}
31
32// Symmetric Jacobi eigenvalue algorithm to compute eigenvalues of S (no eigenvectors)
33vector<double> jacobi_eigenvalues(vector<double> A, size_t p, int max_sweeps = 100, double tol = 1e-10) {
34 // A is p x p symmetric, row-major; will be overwritten
35 vector<double> diag(p);
36 auto offdiag_norm = [&](const vector<double>& M) {
37 double s = 0.0;
38 for (size_t i = 0; i < p; ++i)
39 for (size_t j = 0; j < p; ++j)
40 if (i != j) s += M[idx(i, j, p)] * M[idx(i, j, p)];
41 return sqrt(s);
42 };
43
44 for (int sweep = 0; sweep < max_sweeps; ++sweep) {
45 // One sweep over all i<j pairs
46 double max_off = 0.0;
47 for (size_t i = 0; i + 1 < p; ++i) {
48 for (size_t j = i + 1; j < p; ++j) {
49 double aii = A[idx(i, i, p)];
50 double ajj = A[idx(j, j, p)];
51 double aij = A[idx(i, j, p)];
52 max_off = max(max_off, fabs(aij));
53 if (fabs(aij) < tol) continue;
54 // Compute Jacobi rotation parameters
55 double tau = (ajj - aii) / (2.0 * aij);
56 double t = (tau >= 0.0) ? 1.0 / (tau + sqrt(1.0 + tau * tau))
57 : 1.0 / (tau - sqrt(1.0 + tau * tau));
58 double c = 1.0 / sqrt(1.0 + t * t);
59 double s = t * c;
60 // Apply rotation to rows and columns i and j
61 for (size_t k = 0; k < p; ++k) {
62 if (k != i && k != j) {
63 double aik = A[idx(i, k, p)];
64 double ajk = A[idx(j, k, p)];
65 double tik = c * aik - s * ajk;
66 double tjk = s * aik + c * ajk;
67 A[idx(i, k, p)] = tik;
68 A[idx(k, i, p)] = tik; // symmetry
69 A[idx(j, k, p)] = tjk;
70 A[idx(k, j, p)] = tjk; // symmetry
71 }
72 }
73 // Update diagonal block
74 double aii_new = c * c * aii - 2.0 * s * c * aij + s * s * ajj;
75 double ajj_new = s * s * aii + 2.0 * s * c * aij + c * c * ajj;
76 A[idx(i, i, p)] = aii_new;
77 A[idx(j, j, p)] = ajj_new;
78 A[idx(i, j, p)] = 0.0;
79 A[idx(j, i, p)] = 0.0;
80 }
81 }
82 if (max_off < tol) break; // Converged
83 }
84 for (size_t i = 0; i < p; ++i) diag[i] = A[idx(i, i, p)];
85 return diag;
86}
87
88// Marchenko–Pastur PDF for unit variance
89double mp_pdf(double x, double gamma) {
90 double a = (1.0 - sqrt(gamma)) * (1.0 - sqrt(gamma));
91 double b = (1.0 + sqrt(gamma)) * (1.0 + sqrt(gamma));
92 if (x <= a || x >= b) return 0.0;
93 double val = sqrt((b - x) * (x - a));
94 return val / (2.0 * M_PI * gamma * x);
95}
96
97int main() {
98 ios::sync_with_stdio(false);
99 cin.tie(nullptr);
100
101 // Parameters
102 size_t p = 200; // number of features (rows)
103 size_t n = 400; // number of samples (columns)
104 double gamma = static_cast<double>(p) / static_cast<double>(n);
105
106 // 1) Generate X with i.i.d. N(0,1)
107 std::mt19937_64 rng(12345);
108 std::normal_distribution<double> nd(0.0, 1.0);
109 vector<double> X(p * n);
110 for (size_t i = 0; i < p; ++i)
111 for (size_t k = 0; k < n; ++k)
112 X[i * n + k] = nd(rng);
113
114 // 2) Compute sample covariance S = (1/n) X X^T
115 vector<double> S = sample_covariance(X, p, n);
116
117 // 3) Compute eigenvalues via Jacobi (educational; for speed use LAPACK/Eigen)
118 vector<double> evals = jacobi_eigenvalues(S, p, /*max_sweeps=*/60, /*tol=*/1e-9);
119
120 // 4) Build an empirical density histogram and compare to MP pdf
121 double a = (1.0 - sqrt(gamma)) * (1.0 - sqrt(gamma));
122 double b = (1.0 + sqrt(gamma)) * (1.0 + sqrt(gamma));
123
124 // Determine range for histogram (clip to [min,max] of evals intersect [a,b])
125 double minv = *min_element(evals.begin(), evals.end());
126 double maxv = *max_element(evals.begin(), evals.end());
127 double lo = max(0.0, min(minv, a));
128 double hi = max(maxv, b);
129
130 int bins = 60;
131 vector<int> counts(bins, 0);
132 double width = (hi - lo) / bins;
133 for (double lam : evals) {
134 int bin = (lam <= lo) ? 0 : (lam >= hi ? bins - 1 : static_cast<int>((lam - lo) / width));
135 counts[bin]++;
136 }
137
138 // Print: bin_center, empirical_density, mp_pdf
139 cout << fixed << setprecision(6);
140 cout << "# bin_center\tempirical_density\tmp_pdf\n";
141 for (int i = 0; i < bins; ++i) {
142 double center = lo + (i + 0.5) * width;
143 double empirical_density = static_cast<double>(counts[i]) / (static_cast<double>(p) * width);
144 double theory = mp_pdf(center, gamma);
145 cout << center << "\t" << empirical_density << "\t" << theory << "\n";
146 }
147
148 // Optional: report mass near zero vs atom prediction when gamma>1
149 if (gamma > 1.0) {
150 double atom_theory = 1.0 - 1.0 / gamma;
151 int near_zero = 0;
152 for (double lam : evals) if (lam < 1e-8) near_zero++;
153 double atom_emp = static_cast<double>(near_zero) / static_cast<double>(p);
154 cerr << "# Atom at 0 (theory vs empirical): " << atom_theory << " vs " << atom_emp << "\n";
155 }
156
157 return 0;
158}
159

This program generates a p × n matrix X with i.i.d. standard normal entries, forms the sample covariance S = (1/n)XX^T, and computes all eigenvalues via a simple symmetric Jacobi method. It then constructs a histogram-based empirical density and prints it alongside the theoretical Marchenko–Pastur density at the same bin centers. For gamma > 1, it also reports the expected atom at zero and the observed fraction of (near-)zero eigenvalues. Use a plotting tool to visualize the two columns (empirical vs. theoretical) to see the MP bulk emerge.

Time: O(p^2 n) to form S and approximately O(p^3) for the Jacobi eigensolver (per a modest number of sweeps).Space: O(p n + p^2) to store X and S; O(p^2) workspace during Jacobi.
Stieltjes transform of Marchenko–Pastur and density recovery
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute MP support endpoints for unit variance
5inline pair<double,double> mp_support(double gamma) {
6 double s = sqrt(gamma);
7 return { (1.0 - s) * (1.0 - s), (1.0 + s) * (1.0 + s) };
8}
9
10// Closed-form MP Stieltjes transform m(z) = (1 - gamma - z + sqrt((z - a)(z - b))) / (2 gamma z)
11// Branch chosen so that Im sqrt(...) >= 0 when Im z > 0.
12complex<double> mp_stieltjes(complex<double> z, double gamma) {
13 auto [a, b] = mp_support(gamma);
14 complex<double> root = sqrt((z - a) * (z - b));
15 if (imag(root) < 0) root = -root; // ensure correct analytic branch in upper half-plane
16 return (1.0 - gamma - z + root) / (2.0 * gamma * z);
17}
18
19// Closed-form MP density for unit variance
20double mp_pdf(double x, double gamma) {
21 auto [a, b] = mp_support(gamma);
22 if (x <= a || x >= b) return 0.0;
23 double val = sqrt((b - x) * (x - a));
24 return val / (2.0 * M_PI * gamma * x);
25}
26
27// Approximate density via Stieltjes inversion: f(x) ~= (1/pi) * Im m(x + i*eta)
28// Smaller eta -> more accurate but more oscillatory; choose moderate eta.
29double mp_pdf_from_stieltjes(double x, double gamma, double eta) {
30 complex<double> z(x, eta);
31 complex<double> m = mp_stieltjes(z, gamma);
32 return imag(m) / M_PI;
33}
34
35int main() {
36 ios::sync_with_stdio(false);
37 cin.tie(nullptr);
38
39 double gamma = 0.5; // aspect ratio p/n
40 auto [a, b] = mp_support(gamma);
41 int N = 25; // number of evaluation points across support
42 double eta = 1e-3; // imaginary shift for Stieltjes inversion
43
44 cout << fixed << setprecision(8);
45 cout << "# x\tclosed_form_pdf\tstieltjes_pdf(eta=" << eta << ")\n";
46 for (int i = 0; i <= N; ++i) {
47 double x = a + (b - a) * (static_cast<double>(i) / N);
48 double f_closed = mp_pdf(x, gamma);
49 double f_st = mp_pdf_from_stieltjes(x, gamma, eta);
50 cout << x << "\t" << f_closed << "\t" << f_st << "\n";
51 }
52
53 // Also show behavior slightly outside support (should be near zero)
54 double x_left = max(0.0, a - 0.1 * (b - a));
55 double x_right = b + 0.1 * (b - a);
56 cout << "# Outside support checks:\n";
57 cout << x_left << "\t" << mp_pdf(x_left, gamma) << "\t" << mp_pdf_from_stieltjes(x_left, gamma, eta) << "\n";
58 cout << x_right << "\t" << mp_pdf(x_right, gamma) << "\t" << mp_pdf_from_stieltjes(x_right, gamma, eta) << "\n";
59
60 return 0;
61}
62

This program implements the closed-form Stieltjes transform of the Marchenko–Pastur law and uses the inversion formula f(x) ≈ (1/π) Im m(x + iη) to recover the density numerically. It prints the closed-form density and the Stieltjes-based approximation at grid points across the support and a couple of points outside. Reducing η improves accuracy but may increase numerical oscillations near the edges; values around 1e−3 to 1e−2 are practical for double precision.

Time: O(N) for N evaluation points (constant-time complex arithmetic per point).Space: O(1) additional space beyond outputs.
#marchenko-pastur#random matrix theory#sample covariance#eigenvalue distribution#wishart#pca#stieltjes transform#bulk spectrum#support interval#tracy-widom#resolvent#spectral density#high-dimensional statistics