📚TheoryAdvanced

Deep Learning Generalization Theory

Key Points

  • Deep learning generalization theory tries to explain why overparameterized networks can fit (interpolate) training data yet still perform well on new data.
  • Classical capacity measures like VC dimension or basic Rademacher complexity are usually too loose for modern deep nets because they only count parameters, not the solutions optimization actually finds.
  • Gradient descent with common losses has an implicit regularization: it prefers minimum-norm or max-margin solutions among all interpolators, which often generalize better.
  • The flatness of minima connects to generalization via PAC-Bayes: broader (flatter) regions allow tighter bounds because small parameter noise does not change predictions much.
  • In very wide networks, training behaves like kernel regression with the Neural Tangent Kernel (NTK), making generalization analyzable by kernel methods.
  • Double descent describes test error that first decreases, then increases near the interpolation threshold, and decreases again as models become extremely overparameterized.
  • Lottery ticket and sparsity results suggest a small, well-initialized subnetwork can often explain most of the learned generalization.
  • Margins, compression, and information-theoretic views (information bottleneck) provide complementary insights, but the full picture is still an active research area.

Prerequisites

  • Probability and StatisticsGeneralization is defined in terms of expectations and concentration; PAC-Bayes and margin bounds rely on probabilistic reasoning.
  • Linear AlgebraMinimum-norm solutions, pseudoinverses, kernels, and margins are linear-algebraic.
  • Gradient Descent and Convex OptimizationImplicit regularization arises from optimization dynamics; understanding step sizes and convergence is essential.
  • Statistical Learning TheoryClassical PAC, VC dimension, and Rademacher complexity provide baseline generalization concepts to contrast with modern results.
  • Kernel MethodsThe NTK regime reduces deep network training to kernel regression, requiring familiarity with Gram matrices and ridge regression.

Detailed Explanation

Tap terms for definitions

01Overview

Deep Learning Generalization Theory seeks to answer a paradox: modern neural networks often have far more parameters than training examples and can perfectly fit the training data, yet they still predict well on unseen data. Traditional statistical learning theory suggests that large-capacity models should overfit unless regularized explicitly, but this prediction is at odds with everyday practice in deep learning. This field studies what actually controls generalization in modern regimes. Core ideas include implicit regularization—the observation that optimization methods like gradient descent favor particular low-complexity solutions even without explicit penalties. The geometry of the loss landscape, especially the flatness or sharpness of minima, appears linked to generalization and can be formalized by PAC-Bayes bounds. For very wide networks, the Neural Tangent Kernel (NTK) connects training dynamics to kernel methods, enabling precise predictions in the so-called lazy-training regime. Phenomena like double descent show that test error can improve again after the model reaches the capacity to interpolate, challenging classical bias-variance intuition. Complementary views—margin-based analyses, lottery ticket sparsity, and information bottleneck compression—provide additional, sometimes competing, explanations. Together, these ideas form a toolkit to reason about when and why deep models generalize, though no single theory completely explains all observations.

02Intuition & Analogies

Imagine trying to fit a flexible sheet over a set of pegs on a board: with enough flexibility (many parameters), you can touch every peg exactly. Classical theory warns that a too-flexible sheet will wiggle wildly elsewhere, making poor predictions. Yet, in practice, deep nets often drape smoothly between pegs. Why? One reason is that the sheet isn’t moved arbitrarily—it’s pulled by consistent, gentle tugs (gradient descent) from a particular starting pose (initialization). These tugs preferentially settle the sheet into a smooth configuration that touches the pegs with as little bending as possible, even though many wiggly configurations exist. This is implicit regularization: the search procedure itself has a preference for simple shapes. Flat minima add another layer to the analogy: if the sheet rests on a broad, flat valley, slight nudges (like noise or perturbations) don’t change the fit much, suggesting robustness and better generalization. Sharp valleys, in contrast, change rapidly and may overfit quirks. In very wide networks, another simplification occurs: the sheet is so stiff around its initial pose that training is almost linear—behaving like using a fixed template (kernel) to interpolate, which is well understood mathematically (the NTK view). As models get larger, a surprising double descent can happen: the fit improves, gets worse right when you just barely can touch all pegs (interpolation threshold), and then improves again as the sheet becomes so flexible that it can find very smooth fits that still pass through all pegs. Finally, the lottery ticket idea says you might have a small piece of the sheet that was perfectly shaped from the start; training just finds and amplifies it.

03Formal Definition

Generalization is the ability of a learned predictor f to achieve low population risk R(f) on the data distribution P, given that it is trained to minimize empirical risk \((f)\) on a finite sample: \(R(f) = [(f(x), y)]\), \((f) = (f(), )\). Classical learning theory controls the generalization gap \(R(f) - (f)\) via capacity measures like VC dimension or Rademacher complexity of the hypothesis class \(\), producing bounds of the form \(R(f) (f) + O(())\). In deep learning, the apparent capacity of \(\) is enormous, yet generalization occurs. Modern theories refine the effective hypothesis class by incorporating training dynamics and solution geometry. For squared loss in linear models with underdetermined systems, continuous-time gradient flow from zero initialization converges to the minimum-\(\)-norm interpolant \( y\). For separable data and logistic loss, gradient descent’s direction converges to the maximum-margin classifier \( w \) subject to \( w^ 1\), even without explicit regularization. PAC-Bayes bounds upper-bound expected risk for a stochastic predictor by a term depending on the empirical risk and a KL divergence between a posterior over parameters and a prior, capturing flatness through local posteriors. In wide networks, the Neural Tangent Kernel linearizes training dynamics, yielding kernel regression with kernel \((x,x') = _ (x)^ _ (x')\), enabling precise generalization analysis under overparameterization.

04When to Use

Use these ideas to reason about model size, training procedures, and regularization choices in regimes where classical parameter-count arguments fail. If you are training extremely wide networks or using very small explicit regularization yet seeing good test performance, NTK and implicit bias perspectives explain stability and can inform learning-rate and initialization choices. When your data are (nearly) linearly separable in the learned representation, margin-based theory motivates using losses like logistic or exponential and monitoring margins, rather than only accuracy. PAC-Bayes and flat-minima reasoning are useful when you consider noise injection, stochastic weight averaging, or ensembling; they justify why averaging or finding broader basins can improve generalization. When scaling width far beyond sample size, double descent warns that performance can worsen near the interpolation threshold, suggesting either more overparameterization (counterintuitive but often effective), small ridge regularization, or early stopping. If compute is tight, lottery ticket insights suggest pruning or sparse training can preserve accuracy, especially when fine-tuning large models. Finally, compression and information bottleneck views guide architecture choices that encourage succinct internal representations—for example, bottleneck layers or dropout—to reduce overfitting when label noise is present.

⚠️Common Mistakes

  • Equating parameter count with overfitting: In deep learning, more parameters can improve generalization after the interpolation threshold; use validation curves or margins, not raw counts.
  • Ignoring optimization’s role: Bounds on the entire hypothesis class are loose; what matters is the solution gradient descent finds. Misinterpreting theory by neglecting implicit bias leads to incorrect regularization strategies.
  • Overreliance on flatness as a scalar: Flat vs. sharp depends on parameterization and scale; compare using scale-invariant measures (e.g., normalized sharpness) or PAC-Bayes KL, not raw Hessian norms.
  • Confusing separability with simplicity: Large margins indicate robustness, but separable data with small margins can still overfit; monitor normalized margins and calibration, not just zero training error.
  • Treating NTK as universal: NTK applies best in very wide, small-learning-rate regimes. In feature-learning regimes (finite width, larger steps), linearization fails; expect representation drift and different generalization.
  • Misreading double descent: The interpolation peak can be mitigated by tiny ridge regularization or data augmentation; do not conclude that all interpolating models generalize poorly.
  • Ignoring label noise: Interpolation with noisy labels harms generalization; employ early stopping, small weight decay, or robust losses. Theories explaining benign interpolation often assume clean labels.
  • Cherry-picking lottery tickets: Not every network has an easily findable winning ticket for every task; pruning schedules and initializations matter. Validate sparsity claims on held-out data.

Key Formulas

Population and Empirical Risk

Explanation: Population risk is expected loss on the true distribution; empirical risk is average loss on the training set. Generalization is about making R(f) close to (f).

Rademacher Complexity Bound

Explanation: With probability at least 1-, the population risk is bounded by empirical risk plus a complexity term and a concentration term. For deep nets, \(()\) can be large, making the bound loose.

Minimum-Norm Interpolant (Underdetermined)

Explanation: For linear regression with more parameters than samples and full row rank, the minimum-\(\)-norm solution that interpolates data is given by this pseudoinverse expression. Gradient flow from zero converges to this solution.

Max-Margin Classifier

Explanation: Among all linear separators, the one with the largest margin has minimum \(\) norm under the unit-margin constraint. Logistic/exponential-loss gradient descent on separable data converges in direction to \(\).

First-Order Linearization

Explanation: In wide networks with small updates, the model behaves like its first-order Taylor expansion around initialization. This underlies the NTK analysis.

Neural Tangent Kernel

Explanation: The NTK is the inner product of network gradients at initialization. Training dynamics reduce to kernel regression with this kernel in the lazy-training regime.

PAC-Bayes Bound (One Form)

Explanation: For any prior P and posterior Q over parameters, the expected true risk is bounded by the expected empirical risk plus a term depending on the KL divergence and sample size. Choosing Q concentrated in flat regions yields smaller KL and tighter bounds.

Quadratic Approximation and Sharpness

Explanation: A Taylor expansion of the loss shows how curvature (Hessian H) affects sensitivity to perturbations. Flatter minima have smaller eigenvalues of H, implying robustness.

Double Descent Heuristic

Explanation: When the number of features/parameters m crosses the sample size n, test error can peak and then decrease again for m n, especially with tiny or no regularization.

Mutual Information

Explanation: Information bottleneck views study how hidden representations T retain label-relevant information while discarding input details. Reductions in I(X;T) can correlate with better generalization in some settings.

Complexity Analysis

The code examples use first-order methods and closed-form solvers for linear or kernelized models, reflecting the computational trade-offs common in generalization studies. For overparameterized linear regression trained by gradient descent (Example 1), each epoch costs O(n d) time for matrix–vector products, with O(d) memory for parameters and O(n d) to store the design matrix X. Comparing to the minimum-norm pseudoinverse uses the n×n Gram matrix ; forming G costs O(n d), inverting G by Gaussian elimination is O(), and computing w* = y is O(n d), with O() extra space for G and its inverse. This is efficient when n ≪ d, the typical overparameterized regime. For logistic regression on separable data (Example 2), one gradient step requires O(n d) time and O(d) memory; the number of steps to reach a target margin depends on the margin and step size. While the weight norm diverges, monitoring the normalized margin adds negligible cost. Evaluation on a test set of size adds O( d). Random feature regression (Example 3) constructs Z ∈ in O(n m ) time, then solves ( Z + y. Forming Z is O(n ), and solving via Gaussian elimination is O() time and O() memory. When m ≈ n (the interpolation threshold), costs scale cubically in m; for m ≫ n, iterative solvers or Woodbury identities can reduce cost. These complexities illustrate that kernel/linearized analyses are computationally tractable in the overparameterized but wide-feature regime, while deep nonlinear feature learning typically requires more expensive backpropagation not shown here.

Code Examples

Implicit bias to minimum-norm: Gradient descent vs. pseudoinverse in overparameterized linear regression
1#include <bits/stdc++.h>
2using namespace std;
3
4// Utility: random number generator
5struct RNG {
6 mt19937_64 gen;
7 normal_distribution<double> nd;
8 uniform_real_distribution<double> ud;
9 RNG(unsigned seed=42): gen(seed), nd(0.0,1.0), ud(0.0,1.0) {}
10 double gauss() { return nd(gen); }
11 double uni() { return ud(gen); }
12};
13
14using Vec = vector<double>;
15using Mat = vector<Vec>;
16
17Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); }
18Vec zeros_vec(int n){ return Vec(n, 0.0); }
19
20Mat transpose(const Mat& A){
21 int r=A.size(), c=A[0].size();
22 Mat AT(c, Vec(r));
23 for(int i=0;i<r;++i) for(int j=0;j<c;++j) AT[j][i]=A[i][j];
24 return AT;
25}
26
27Vec matvec(const Mat& A, const Vec& x){
28 int r=A.size(), c=A[0].size();
29 Vec y(r,0.0);
30 for(int i=0;i<r;++i){
31 double s=0.0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s;
32 }
33 return y;
34}
35
36Mat matmul(const Mat& A, const Mat& B){
37 int r=A.size(), k=A[0].size(), c=B[0].size();
38 Mat C(r, Vec(c, 0.0));
39 for(int i=0;i<r;++i){
40 for(int t=0;t<k;++t){
41 double a = A[i][t];
42 if (a==0) continue;
43 for(int j=0;j<c;++j) C[i][j] += a * B[t][j];
44 }
45 }
46 return C;
47}
48
49Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
50Vec sub(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]-b[i]; return c; }
51Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
52double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
53double norm2(const Vec& a){ return sqrt(max(0.0, dot(a,a))); }
54
55// Gaussian elimination with partial pivoting to invert a small matrix
56bool invertMatrix(Mat A, Mat &Ainv){
57 int n = A.size();
58 Ainv = zeros(n,n);
59 for(int i=0;i<n;++i) Ainv[i][i]=1.0;
60 for(int col=0; col<n; ++col){
61 // pivot
62 int piv = col;
63 for(int r=col+1;r<n;++r) if (fabs(A[r][col])>fabs(A[piv][col])) piv=r;
64 if (fabs(A[piv][col]) < 1e-12) return false; // singular
65 if (piv!=col){ swap(A[piv],A[col]); swap(Ainv[piv],Ainv[col]); }
66 double diag = A[col][col];
67 for(int j=0;j<n;++j){ A[col][j]/=diag; Ainv[col][j]/=diag; }
68 for(int r=0;r<n;++r){ if (r==col) continue; double f=A[r][col]; if (fabs(f)<1e-18) continue; for(int j=0;j<n;++j){ A[r][j]-=f*A[col][j]; Ainv[r][j]-=f*Ainv[col][j]; } }
69 }
70 return true;
71}
72
73int main(){
74 ios::sync_with_stdio(false);
75 cin.tie(nullptr);
76
77 RNG rng(123);
78
79 int n = 30; // samples
80 int d = 100; // features (overparameterized: d >> n)
81 // Generate random design X (n x d)
82 Mat X(n, Vec(d));
83 for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j] = rng.gauss()/sqrt(d);
84
85 // Create a ground-truth sparse w_true and noiseless targets y
86 Vec w_true(d, 0.0);
87 for(int j=0;j<5;++j){ int idx = j* (d/5); w_true[idx] = (j%2? -2.0: 2.0); }
88 Vec y = matvec(X, w_true); // perfect fit possible
89
90 // Gradient descent from zero on squared loss: minimize (1/2n)||X w - y||^2
91 Vec w(d, 0.0);
92 double lr = 1.0; // step size; safe for our scaled X
93 int epochs = 2000;
94 for(int e=0;e<epochs;++e){
95 Vec r = sub(matvec(X, w), y); // residual (n)
96 // gradient: (1/n) X^T (X w - y)
97 Mat XT = transpose(X);
98 Vec g(d, 0.0);
99 for(int j=0;j<d;++j){ g[j] = 0.0; for(int i=0;i<n;++i) g[j] += XT[j][i] * r[i]; g[j] /= n; }
100 // update
101 for(int j=0;j<d;++j) w[j] -= lr * g[j];
102 if ((e+1)%500==0){
103 double mse=0; for(double ri: r) mse+=ri*ri; mse/=n;
104 cerr << "epoch " << (e+1) << ": train MSE=" << mse << ", ||w||2=" << norm2(w) << "\n";
105 }
106 }
107
108 // Compute minimum-norm interpolant via pseudoinverse: w_mn = X^T (X X^T)^{-1} y
109 Mat XT = transpose(X);
110 Mat G = matmul(X, XT); // n x n
111 Mat Ginv;
112 bool ok = invertMatrix(G, Ginv);
113 if(!ok){ cerr << "Gram matrix not invertible; try regenerating data.\n"; return 0; }
114 // Compute alpha = Ginv * y
115 Vec alpha(n,0.0);
116 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s += Ginv[i][j]*y[j]; alpha[i]=s; }
117 // w_mn = X^T * alpha
118 Vec w_mn(d,0.0);
119 for(int j=0;j<d;++j){ double s=0; for(int i=0;i<n;++i) s += XT[j][i]*alpha[i]; w_mn[j]=s; }
120
121 // Compare norms and training error
122 Vec r_gd = sub(matvec(X, w), y);
123 Vec r_mn = sub(matvec(X, w_mn), y);
124 auto mse_of = [&](const Vec& r){ double s=0; for(double v: r) s+=v*v; return s/n; };
125 cout.setf(std::ios::fixed); cout<<setprecision(6);
126 cout << "GD: MSE=" << mse_of(r_gd) << ", ||w||2=" << norm2(w) << "\n";
127 cout << "PInv: MSE=" << mse_of(r_mn) << ", ||w_mn||2=" << norm2(w_mn) << "\n";
128 cout << "||w - w_mn||2=" << norm2(sub(w, w_mn)) << "\n";
129
130 return 0;
131}
132

We construct an overparameterized linear regression problem (d ≫ n) with noiseless labels, run gradient descent from zero, and compare the learned weights to the minimum-norm interpolant computed via the pseudoinverse formula w* = X^T (X X^T)^{-1} y. In this regime, gradient descent converges to the minimum-ℓ2-norm interpolator among infinitely many solutions, empirically shown by near-identical norms and parameters.

Time: Per epoch O(n d) for gradient computation; total O(epochs × n d). Pseudoinverse via Gram inversion costs O(n d + n^3 + n d).Space: O(n d) to store X, O(d) for parameters, and O(n^2) for the Gram matrix and its inverse.
Implicit max-margin: Logistic regression on separable data grows margin without explicit regularization
1#include <bits/stdc++.h>
2using namespace std;
3struct RNG { mt19937_64 g; normal_distribution<double> nd; RNG(unsigned s=1):g(s),nd(0,1){} double gauss(){return nd(g);} double uni(){return uniform_real_distribution<double>(0,1)(g);} };
4using Vec = vector<double>; using Mat = vector<Vec>;
5
6double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
7Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
8Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
9double norm2(const Vec& a){ double s=0; for(double v:a) s+=v*v; return sqrt(max(0.0,s)); }
10
11struct Dataset { Mat X; vector<int> y; };
12
13Dataset make_separable(int n_per_class, int d, unsigned seed=7){
14 RNG rng(seed);
15 Mat X; vector<int> y;
16 // Two Gaussians along first coordinate; rest are noise
17 for(int i=0;i<n_per_class;i++){
18 Vec v(d,0.0);
19 v[0] = 3.0 + 0.5*rng.gauss();
20 for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
21 X.push_back(v); y.push_back(+1);
22 }
23 for(int i=0;i<n_per_class;i++){
24 Vec v(d,0.0);
25 v[0] = -3.0 + 0.5*rng.gauss();
26 for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
27 X.push_back(v); y.push_back(-1);
28 }
29 return {X,y};
30}
31
32double logistic_loss(const Vec& w, const Mat& X, const vector<int>& y){
33 double L=0; int n=X.size();
34 for(int i=0;i<n;++i){ double z = y[i]*dot(w, X[i]); L += log(1.0 + exp(-z)); }
35 return L/n;
36}
37
38Vec logistic_grad(const Vec& w, const Mat& X, const vector<int>& y){
39 int n=X.size(), d=X[0].size(); Vec g(d,0.0);
40 for(int i=0;i<n;++i){
41 double z = y[i]*dot(w, X[i]);
42 double s = 1.0/(1.0 + exp(z)); // = sigmoid(-z)
43 double coeff = -y[i]*s; // gradient multiplier
44 for(int j=0;j<d;++j) g[j] += coeff * X[i][j];
45 }
46 for(int j=0;j<d;++j) g[j] /= n;
47 return g;
48}
49
50double min_normalized_margin(const Vec& w, const Mat& X, const vector<int>& y){
51 double nw = norm2(w); if (nw==0) return 0.0;
52 double m = 1e9; int n=X.size();
53 for(int i=0;i<n;++i){ m = min(m, y[i]*dot(w, X[i])/nw); }
54 return m;
55}
56
57pair<double,double> accuracy_and_margin(const Vec& w, const Mat& X, const vector<int>& y){
58 int n=X.size(); int correct=0; for(int i=0;i<n;++i){ double s=dot(w,X[i]); if ((s>=0 && y[i]==+1) || (s<0 && y[i]==-1)) correct++; }
59 double acc = (double)correct/n; double m = min_normalized_margin(w, X, y);
60 return {acc,m};
61}
62
63int main(){
64 ios::sync_with_stdio(false); cin.tie(nullptr);
65 int n_per_class = 100, d = 50;
66 auto train = make_separable(n_per_class, d, 11);
67 auto test = make_separable(n_per_class, d, 12);
68
69 Vec w(d,0.0);
70 double lr0 = 1.0; int epochs=2000;
71 for(int e=0;e<epochs;++e){
72 Vec g = logistic_grad(w, train.X, train.y);
73 // simple diminishing step size to stabilize
74 double lr = lr0 / sqrt(1.0 + e*0.1);
75 for(int j=0;j<d;++j) w[j] -= lr * g[j];
76 if ((e+1)%400==0){
77 auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
78 auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
79 double L = logistic_loss(w, train.X, train.y);
80 cerr<<"epoch "<<(e+1)<<" loss="<<L<<" train_acc="<<acc_tr<<" test_acc="<<acc_te<<" norm="<<norm2(w)<<" min_norm_margin="<<m_tr<<"\n";
81 }
82 }
83
84 auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
85 auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
86 cout.setf(std::ios::fixed); cout<<setprecision(6);
87 cout << "Final train_acc=" << acc_tr << ", test_acc=" << acc_te << ", ||w||2=" << norm2(w) << ", min_norm_margin_train=" << m_tr << "\n";
88 return 0;
89}
90

We generate linearly separable data and train logistic regression with no explicit regularization from zero initialization. The training loss decreases toward zero, the weight norm grows, and the normalized margin increases—illustrating the implicit max-margin bias of gradient descent on separable data. Despite interpolation (zero loss in the limit), test accuracy is strong because the learned direction approaches the maximum-margin separator.

Time: Each epoch is O(n d) for gradient computation; total O(epochs × n d). Evaluation adds O(n_test d).Space: O(d) for parameters and O(n d) to store the dataset.
Double descent via random feature regression (lazy training / NTK flavor)
1#include <bits/stdc++.h>
2using namespace std;
3struct RNG { mt19937_64 g; normal_distribution<double> nd; uniform_real_distribution<double> ud; RNG(unsigned s=123):g(s),nd(0,1),ud(0,1){} double gauss(){return nd(g);} double uni(){return ud(g);} };
4using Vec = vector<double>; using Mat = vector<Vec>;
5
6Mat zeros(int r,int c){ return Mat(r, Vec(c,0.0)); }
7
8// Solve linear system A x = b by Gaussian elimination with partial pivoting
9bool solve_linear(Mat A, Vec b, Vec &x){
10 int n=A.size();
11 for(int i=0;i<n;++i){
12 // pivot
13 int piv=i; for(int r=i+1;r<n;++r) if (fabs(A[r][i])>fabs(A[piv][i])) piv=r;
14 if (fabs(A[piv][i]) < 1e-12) return false;
15 if (piv!=i){ swap(A[piv],A[i]); swap(b[piv],b[i]); }
16 double diag = A[i][i];
17 for(int j=i;j<n;++j) A[i][j]/=diag; b[i]/=diag;
18 for(int r=0;r<n;++r){ if (r==i) continue; double f=A[r][i]; if (fabs(f)<1e-18) continue; for(int j=i;j<n;++j) A[r][j]-=f*A[i][j]; b[r]-=f*b[i]; }
19 }
20 x=b; return true;
21}
22
23struct Data { Mat X; Vec y; };
24
25Data make_nonlinear_reg(int n, int d_in, unsigned seed=9){
26 RNG rng(seed);
27 Mat X(n, Vec(d_in)); Vec y(n);
28 for(int i=0;i<n;++i){
29 for(int j=0;j<d_in;++j) X[i][j] = rng.gauss();
30 // Nonlinear target: y = sin(x0) + 0.5*cos(x1) + 0.2*x2^2 + noise
31 y[i] = sin(X[i][0]) + 0.5*cos(X[i][1]) + 0.2*X[i][2]*X[i][2] + 0.1*rng.gauss();
32 }
33 return {X,y};
34}
35
36// Random Fourier Features: z = sqrt(2/m) * cos(W x + b)
37Mat rff_features(const Mat& X, int m, RNG &rng){
38 int n=X.size(), d_in=X[0].size();
39 Mat Z(n, Vec(m,0.0));
40 // Sample W ~ N(0, I) and b ~ Uniform(0, 2pi)
41 vector<Vec> W(m, Vec(d_in)); Vec b(m);
42 for(int k=0;k<m;++k){ for(int j=0;j<d_in;++j) W[k][j]=rng.gauss(); b[k]= 2*M_PI*rng.uni(); }
43 double s = sqrt(2.0/m);
44 for(int i=0;i<n;++i){
45 for(int k=0;k<m;++k){ double t=0; for(int j=0;j<d_in;++j) t += W[k][j]*X[i][j]; Z[i][k] = s * cos(t + b[k]); }
46 }
47 return Z;
48}
49
50Vec ridge_regress(const Mat& Z, const Vec& y, double lambda){
51 int n=Z.size(), m=Z[0].size();
52 // Compute A = Z^T Z + lambda I (m x m), and rhs = Z^T y
53 Mat A(m, Vec(m, 0.0)); Vec rhs(m, 0.0);
54 for(int i=0;i<n;++i){
55 for(int k=0;k<m;++k){ rhs[k] += Z[i][k]*y[i]; }
56 for(int k=0;k<m;++k){ double zik = Z[i][k]; if (fabs(zik)<1e-18) continue; for(int l=0;l<m;++l) A[k][l] += zik * Z[i][l]; }
57 }
58 for(int k=0;k<m;++k) A[k][k] += lambda;
59 Vec w;
60 bool ok = solve_linear(A, rhs, w);
61 if(!ok){ w.assign(m, 0.0); }
62 return w;
63}
64
65Vec matvec(const Mat& A, const Vec& x){ int r=A.size(), c=A[0].size(); Vec y(r,0.0); for(int i=0;i<r;++i){ double s=0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
66
67int main(){
68 ios::sync_with_stdio(false); cin.tie(nullptr);
69
70 int n_train=120, n_test=1000, d_in=10;
71 auto train = make_nonlinear_reg(n_train, d_in, 21);
72 auto test = make_nonlinear_reg(n_test, d_in, 22);
73
74 vector<int> Ms = {20, 60, 120, 200, 400, 800}; // vary #features; interpolation near m ~ n
75 double lambda = 1e-6; // tiny ridge to stabilize
76
77 cout.setf(std::ios::fixed); cout<<setprecision(6);
78 for(int m: Ms){
79 RNG rng(100 + m);
80 Mat Ztr = rff_features(train.X, m, rng);
81 Mat Zte = rff_features(test.X, m, rng); // same W,b as above because rng seeded identically per m
82 Vec w = ridge_regress(Ztr, train.y, lambda);
83 auto pred = matvec(Zte, w);
84 // compute MSE
85 auto mse = [&](const Vec& p, const Vec& y){ double s=0; for(size_t i=0;i<p.size();++i){ double e=p[i]-y[i]; s+=e*e; } return s/p.size(); };
86 // Also compute train MSE
87 auto pred_tr = matvec(Ztr, w);
88 double mse_tr = mse(pred_tr, train.y);
89 double mse_te = mse(pred, test.y);
90 cout << "m="<<setw(4)<<m<<" train_MSE="<<mse_tr<<" test_MSE="<<mse_te<<"\n";
91 }
92 return 0;
93}
94

We approximate a kernel method (lazy training flavor) with Random Fourier Features and solve ridge regression while varying the number of features m. As m sweeps through the interpolation threshold around m ≈ n, the printed test MSE typically rises then falls (double descent), while train MSE decreases. This mirrors modern observations in overparameterized models and highlights the stabilizing effect of slight ridge regularization.

Time: Feature construction O(n m d_in); forming Z^T Z is O(n m^2); solving the m×m system is O(m^3). Prediction is O(n_test m).Space: O(n m) for the feature matrix and O(m^2) for the normal equations.