Deep Learning Generalization Theory
Key Points
- •Deep learning generalization theory tries to explain why overparameterized networks can fit (interpolate) training data yet still perform well on new data.
- •Classical capacity measures like VC dimension or basic Rademacher complexity are usually too loose for modern deep nets because they only count parameters, not the solutions optimization actually finds.
- •Gradient descent with common losses has an implicit regularization: it prefers minimum-norm or max-margin solutions among all interpolators, which often generalize better.
- •The flatness of minima connects to generalization via PAC-Bayes: broader (flatter) regions allow tighter bounds because small parameter noise does not change predictions much.
- •In very wide networks, training behaves like kernel regression with the Neural Tangent Kernel (NTK), making generalization analyzable by kernel methods.
- •Double descent describes test error that first decreases, then increases near the interpolation threshold, and decreases again as models become extremely overparameterized.
- •Lottery ticket and sparsity results suggest a small, well-initialized subnetwork can often explain most of the learned generalization.
- •Margins, compression, and information-theoretic views (information bottleneck) provide complementary insights, but the full picture is still an active research area.
Prerequisites
- →Probability and Statistics — Generalization is defined in terms of expectations and concentration; PAC-Bayes and margin bounds rely on probabilistic reasoning.
- →Linear Algebra — Minimum-norm solutions, pseudoinverses, kernels, and margins are linear-algebraic.
- →Gradient Descent and Convex Optimization — Implicit regularization arises from optimization dynamics; understanding step sizes and convergence is essential.
- →Statistical Learning Theory — Classical PAC, VC dimension, and Rademacher complexity provide baseline generalization concepts to contrast with modern results.
- →Kernel Methods — The NTK regime reduces deep network training to kernel regression, requiring familiarity with Gram matrices and ridge regression.
Detailed Explanation
Tap terms for definitions01Overview
Deep Learning Generalization Theory seeks to answer a paradox: modern neural networks often have far more parameters than training examples and can perfectly fit the training data, yet they still predict well on unseen data. Traditional statistical learning theory suggests that large-capacity models should overfit unless regularized explicitly, but this prediction is at odds with everyday practice in deep learning. This field studies what actually controls generalization in modern regimes. Core ideas include implicit regularization—the observation that optimization methods like gradient descent favor particular low-complexity solutions even without explicit penalties. The geometry of the loss landscape, especially the flatness or sharpness of minima, appears linked to generalization and can be formalized by PAC-Bayes bounds. For very wide networks, the Neural Tangent Kernel (NTK) connects training dynamics to kernel methods, enabling precise predictions in the so-called lazy-training regime. Phenomena like double descent show that test error can improve again after the model reaches the capacity to interpolate, challenging classical bias-variance intuition. Complementary views—margin-based analyses, lottery ticket sparsity, and information bottleneck compression—provide additional, sometimes competing, explanations. Together, these ideas form a toolkit to reason about when and why deep models generalize, though no single theory completely explains all observations.
02Intuition & Analogies
Imagine trying to fit a flexible sheet over a set of pegs on a board: with enough flexibility (many parameters), you can touch every peg exactly. Classical theory warns that a too-flexible sheet will wiggle wildly elsewhere, making poor predictions. Yet, in practice, deep nets often drape smoothly between pegs. Why? One reason is that the sheet isn’t moved arbitrarily—it’s pulled by consistent, gentle tugs (gradient descent) from a particular starting pose (initialization). These tugs preferentially settle the sheet into a smooth configuration that touches the pegs with as little bending as possible, even though many wiggly configurations exist. This is implicit regularization: the search procedure itself has a preference for simple shapes. Flat minima add another layer to the analogy: if the sheet rests on a broad, flat valley, slight nudges (like noise or perturbations) don’t change the fit much, suggesting robustness and better generalization. Sharp valleys, in contrast, change rapidly and may overfit quirks. In very wide networks, another simplification occurs: the sheet is so stiff around its initial pose that training is almost linear—behaving like using a fixed template (kernel) to interpolate, which is well understood mathematically (the NTK view). As models get larger, a surprising double descent can happen: the fit improves, gets worse right when you just barely can touch all pegs (interpolation threshold), and then improves again as the sheet becomes so flexible that it can find very smooth fits that still pass through all pegs. Finally, the lottery ticket idea says you might have a small piece of the sheet that was perfectly shaped from the start; training just finds and amplifies it.
03Formal Definition
04When to Use
Use these ideas to reason about model size, training procedures, and regularization choices in regimes where classical parameter-count arguments fail. If you are training extremely wide networks or using very small explicit regularization yet seeing good test performance, NTK and implicit bias perspectives explain stability and can inform learning-rate and initialization choices. When your data are (nearly) linearly separable in the learned representation, margin-based theory motivates using losses like logistic or exponential and monitoring margins, rather than only accuracy. PAC-Bayes and flat-minima reasoning are useful when you consider noise injection, stochastic weight averaging, or ensembling; they justify why averaging or finding broader basins can improve generalization. When scaling width far beyond sample size, double descent warns that performance can worsen near the interpolation threshold, suggesting either more overparameterization (counterintuitive but often effective), small ridge regularization, or early stopping. If compute is tight, lottery ticket insights suggest pruning or sparse training can preserve accuracy, especially when fine-tuning large models. Finally, compression and information bottleneck views guide architecture choices that encourage succinct internal representations—for example, bottleneck layers or dropout—to reduce overfitting when label noise is present.
⚠️Common Mistakes
- Equating parameter count with overfitting: In deep learning, more parameters can improve generalization after the interpolation threshold; use validation curves or margins, not raw counts.
- Ignoring optimization’s role: Bounds on the entire hypothesis class are loose; what matters is the solution gradient descent finds. Misinterpreting theory by neglecting implicit bias leads to incorrect regularization strategies.
- Overreliance on flatness as a scalar: Flat vs. sharp depends on parameterization and scale; compare using scale-invariant measures (e.g., normalized sharpness) or PAC-Bayes KL, not raw Hessian norms.
- Confusing separability with simplicity: Large margins indicate robustness, but separable data with small margins can still overfit; monitor normalized margins and calibration, not just zero training error.
- Treating NTK as universal: NTK applies best in very wide, small-learning-rate regimes. In feature-learning regimes (finite width, larger steps), linearization fails; expect representation drift and different generalization.
- Misreading double descent: The interpolation peak can be mitigated by tiny ridge regularization or data augmentation; do not conclude that all interpolating models generalize poorly.
- Ignoring label noise: Interpolation with noisy labels harms generalization; employ early stopping, small weight decay, or robust losses. Theories explaining benign interpolation often assume clean labels.
- Cherry-picking lottery tickets: Not every network has an easily findable winning ticket for every task; pruning schedules and initializations matter. Validate sparsity claims on held-out data.
Key Formulas
Population and Empirical Risk
Explanation: Population risk is expected loss on the true distribution; empirical risk is average loss on the training set. Generalization is about making R(f) close to (f).
Rademacher Complexity Bound
Explanation: With probability at least 1-, the population risk is bounded by empirical risk plus a complexity term and a concentration term. For deep nets, \(()\) can be large, making the bound loose.
Minimum-Norm Interpolant (Underdetermined)
Explanation: For linear regression with more parameters than samples and full row rank, the minimum-\(\)-norm solution that interpolates data is given by this pseudoinverse expression. Gradient flow from zero converges to this solution.
Max-Margin Classifier
Explanation: Among all linear separators, the one with the largest margin has minimum \(\) norm under the unit-margin constraint. Logistic/exponential-loss gradient descent on separable data converges in direction to \(\).
First-Order Linearization
Explanation: In wide networks with small updates, the model behaves like its first-order Taylor expansion around initialization. This underlies the NTK analysis.
Neural Tangent Kernel
Explanation: The NTK is the inner product of network gradients at initialization. Training dynamics reduce to kernel regression with this kernel in the lazy-training regime.
PAC-Bayes Bound (One Form)
Explanation: For any prior P and posterior Q over parameters, the expected true risk is bounded by the expected empirical risk plus a term depending on the KL divergence and sample size. Choosing Q concentrated in flat regions yields smaller KL and tighter bounds.
Quadratic Approximation and Sharpness
Explanation: A Taylor expansion of the loss shows how curvature (Hessian H) affects sensitivity to perturbations. Flatter minima have smaller eigenvalues of H, implying robustness.
Double Descent Heuristic
Explanation: When the number of features/parameters m crosses the sample size n, test error can peak and then decrease again for m n, especially with tiny or no regularization.
Mutual Information
Explanation: Information bottleneck views study how hidden representations T retain label-relevant information while discarding input details. Reductions in I(X;T) can correlate with better generalization in some settings.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Utility: random number generator 5 struct RNG { 6 mt19937_64 gen; 7 normal_distribution<double> nd; 8 uniform_real_distribution<double> ud; 9 RNG(unsigned seed=42): gen(seed), nd(0.0,1.0), ud(0.0,1.0) {} 10 double gauss() { return nd(gen); } 11 double uni() { return ud(gen); } 12 }; 13 14 using Vec = vector<double>; 15 using Mat = vector<Vec>; 16 17 Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); } 18 Vec zeros_vec(int n){ return Vec(n, 0.0); } 19 20 Mat transpose(const Mat& A){ 21 int r=A.size(), c=A[0].size(); 22 Mat AT(c, Vec(r)); 23 for(int i=0;i<r;++i) for(int j=0;j<c;++j) AT[j][i]=A[i][j]; 24 return AT; 25 } 26 27 Vec matvec(const Mat& A, const Vec& x){ 28 int r=A.size(), c=A[0].size(); 29 Vec y(r,0.0); 30 for(int i=0;i<r;++i){ 31 double s=0.0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s; 32 } 33 return y; 34 } 35 36 Mat matmul(const Mat& A, const Mat& B){ 37 int r=A.size(), k=A[0].size(), c=B[0].size(); 38 Mat C(r, Vec(c, 0.0)); 39 for(int i=0;i<r;++i){ 40 for(int t=0;t<k;++t){ 41 double a = A[i][t]; 42 if (a==0) continue; 43 for(int j=0;j<c;++j) C[i][j] += a * B[t][j]; 44 } 45 } 46 return C; 47 } 48 49 Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; } 50 Vec sub(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]-b[i]; return c; } 51 Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; } 52 double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 53 double norm2(const Vec& a){ return sqrt(max(0.0, dot(a,a))); } 54 55 // Gaussian elimination with partial pivoting to invert a small matrix 56 bool invertMatrix(Mat A, Mat &Ainv){ 57 int n = A.size(); 58 Ainv = zeros(n,n); 59 for(int i=0;i<n;++i) Ainv[i][i]=1.0; 60 for(int col=0; col<n; ++col){ 61 // pivot 62 int piv = col; 63 for(int r=col+1;r<n;++r) if (fabs(A[r][col])>fabs(A[piv][col])) piv=r; 64 if (fabs(A[piv][col]) < 1e-12) return false; // singular 65 if (piv!=col){ swap(A[piv],A[col]); swap(Ainv[piv],Ainv[col]); } 66 double diag = A[col][col]; 67 for(int j=0;j<n;++j){ A[col][j]/=diag; Ainv[col][j]/=diag; } 68 for(int r=0;r<n;++r){ if (r==col) continue; double f=A[r][col]; if (fabs(f)<1e-18) continue; for(int j=0;j<n;++j){ A[r][j]-=f*A[col][j]; Ainv[r][j]-=f*Ainv[col][j]; } } 69 } 70 return true; 71 } 72 73 int main(){ 74 ios::sync_with_stdio(false); 75 cin.tie(nullptr); 76 77 RNG rng(123); 78 79 int n = 30; // samples 80 int d = 100; // features (overparameterized: d >> n) 81 // Generate random design X (n x d) 82 Mat X(n, Vec(d)); 83 for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j] = rng.gauss()/sqrt(d); 84 85 // Create a ground-truth sparse w_true and noiseless targets y 86 Vec w_true(d, 0.0); 87 for(int j=0;j<5;++j){ int idx = j* (d/5); w_true[idx] = (j%2? -2.0: 2.0); } 88 Vec y = matvec(X, w_true); // perfect fit possible 89 90 // Gradient descent from zero on squared loss: minimize (1/2n)||X w - y||^2 91 Vec w(d, 0.0); 92 double lr = 1.0; // step size; safe for our scaled X 93 int epochs = 2000; 94 for(int e=0;e<epochs;++e){ 95 Vec r = sub(matvec(X, w), y); // residual (n) 96 // gradient: (1/n) X^T (X w - y) 97 Mat XT = transpose(X); 98 Vec g(d, 0.0); 99 for(int j=0;j<d;++j){ g[j] = 0.0; for(int i=0;i<n;++i) g[j] += XT[j][i] * r[i]; g[j] /= n; } 100 // update 101 for(int j=0;j<d;++j) w[j] -= lr * g[j]; 102 if ((e+1)%500==0){ 103 double mse=0; for(double ri: r) mse+=ri*ri; mse/=n; 104 cerr << "epoch " << (e+1) << ": train MSE=" << mse << ", ||w||2=" << norm2(w) << "\n"; 105 } 106 } 107 108 // Compute minimum-norm interpolant via pseudoinverse: w_mn = X^T (X X^T)^{-1} y 109 Mat XT = transpose(X); 110 Mat G = matmul(X, XT); // n x n 111 Mat Ginv; 112 bool ok = invertMatrix(G, Ginv); 113 if(!ok){ cerr << "Gram matrix not invertible; try regenerating data.\n"; return 0; } 114 // Compute alpha = Ginv * y 115 Vec alpha(n,0.0); 116 for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s += Ginv[i][j]*y[j]; alpha[i]=s; } 117 // w_mn = X^T * alpha 118 Vec w_mn(d,0.0); 119 for(int j=0;j<d;++j){ double s=0; for(int i=0;i<n;++i) s += XT[j][i]*alpha[i]; w_mn[j]=s; } 120 121 // Compare norms and training error 122 Vec r_gd = sub(matvec(X, w), y); 123 Vec r_mn = sub(matvec(X, w_mn), y); 124 auto mse_of = [&](const Vec& r){ double s=0; for(double v: r) s+=v*v; return s/n; }; 125 cout.setf(std::ios::fixed); cout<<setprecision(6); 126 cout << "GD: MSE=" << mse_of(r_gd) << ", ||w||2=" << norm2(w) << "\n"; 127 cout << "PInv: MSE=" << mse_of(r_mn) << ", ||w_mn||2=" << norm2(w_mn) << "\n"; 128 cout << "||w - w_mn||2=" << norm2(sub(w, w_mn)) << "\n"; 129 130 return 0; 131 } 132
We construct an overparameterized linear regression problem (d ≫ n) with noiseless labels, run gradient descent from zero, and compare the learned weights to the minimum-norm interpolant computed via the pseudoinverse formula w* = X^T (X X^T)^{-1} y. In this regime, gradient descent converges to the minimum-ℓ2-norm interpolator among infinitely many solutions, empirically shown by near-identical norms and parameters.
1 #include <bits/stdc++.h> 2 using namespace std; 3 struct RNG { mt19937_64 g; normal_distribution<double> nd; RNG(unsigned s=1):g(s),nd(0,1){} double gauss(){return nd(g);} double uni(){return uniform_real_distribution<double>(0,1)(g);} }; 4 using Vec = vector<double>; using Mat = vector<Vec>; 5 6 double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; } 7 Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; } 8 Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; } 9 double norm2(const Vec& a){ double s=0; for(double v:a) s+=v*v; return sqrt(max(0.0,s)); } 10 11 struct Dataset { Mat X; vector<int> y; }; 12 13 Dataset make_separable(int n_per_class, int d, unsigned seed=7){ 14 RNG rng(seed); 15 Mat X; vector<int> y; 16 // Two Gaussians along first coordinate; rest are noise 17 for(int i=0;i<n_per_class;i++){ 18 Vec v(d,0.0); 19 v[0] = 3.0 + 0.5*rng.gauss(); 20 for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss(); 21 X.push_back(v); y.push_back(+1); 22 } 23 for(int i=0;i<n_per_class;i++){ 24 Vec v(d,0.0); 25 v[0] = -3.0 + 0.5*rng.gauss(); 26 for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss(); 27 X.push_back(v); y.push_back(-1); 28 } 29 return {X,y}; 30 } 31 32 double logistic_loss(const Vec& w, const Mat& X, const vector<int>& y){ 33 double L=0; int n=X.size(); 34 for(int i=0;i<n;++i){ double z = y[i]*dot(w, X[i]); L += log(1.0 + exp(-z)); } 35 return L/n; 36 } 37 38 Vec logistic_grad(const Vec& w, const Mat& X, const vector<int>& y){ 39 int n=X.size(), d=X[0].size(); Vec g(d,0.0); 40 for(int i=0;i<n;++i){ 41 double z = y[i]*dot(w, X[i]); 42 double s = 1.0/(1.0 + exp(z)); // = sigmoid(-z) 43 double coeff = -y[i]*s; // gradient multiplier 44 for(int j=0;j<d;++j) g[j] += coeff * X[i][j]; 45 } 46 for(int j=0;j<d;++j) g[j] /= n; 47 return g; 48 } 49 50 double min_normalized_margin(const Vec& w, const Mat& X, const vector<int>& y){ 51 double nw = norm2(w); if (nw==0) return 0.0; 52 double m = 1e9; int n=X.size(); 53 for(int i=0;i<n;++i){ m = min(m, y[i]*dot(w, X[i])/nw); } 54 return m; 55 } 56 57 pair<double,double> accuracy_and_margin(const Vec& w, const Mat& X, const vector<int>& y){ 58 int n=X.size(); int correct=0; for(int i=0;i<n;++i){ double s=dot(w,X[i]); if ((s>=0 && y[i]==+1) || (s<0 && y[i]==-1)) correct++; } 59 double acc = (double)correct/n; double m = min_normalized_margin(w, X, y); 60 return {acc,m}; 61 } 62 63 int main(){ 64 ios::sync_with_stdio(false); cin.tie(nullptr); 65 int n_per_class = 100, d = 50; 66 auto train = make_separable(n_per_class, d, 11); 67 auto test = make_separable(n_per_class, d, 12); 68 69 Vec w(d,0.0); 70 double lr0 = 1.0; int epochs=2000; 71 for(int e=0;e<epochs;++e){ 72 Vec g = logistic_grad(w, train.X, train.y); 73 // simple diminishing step size to stabilize 74 double lr = lr0 / sqrt(1.0 + e*0.1); 75 for(int j=0;j<d;++j) w[j] -= lr * g[j]; 76 if ((e+1)%400==0){ 77 auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y); 78 auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y); 79 double L = logistic_loss(w, train.X, train.y); 80 cerr<<"epoch "<<(e+1)<<" loss="<<L<<" train_acc="<<acc_tr<<" test_acc="<<acc_te<<" norm="<<norm2(w)<<" min_norm_margin="<<m_tr<<"\n"; 81 } 82 } 83 84 auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y); 85 auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y); 86 cout.setf(std::ios::fixed); cout<<setprecision(6); 87 cout << "Final train_acc=" << acc_tr << ", test_acc=" << acc_te << ", ||w||2=" << norm2(w) << ", min_norm_margin_train=" << m_tr << "\n"; 88 return 0; 89 } 90
We generate linearly separable data and train logistic regression with no explicit regularization from zero initialization. The training loss decreases toward zero, the weight norm grows, and the normalized margin increases—illustrating the implicit max-margin bias of gradient descent on separable data. Despite interpolation (zero loss in the limit), test accuracy is strong because the learned direction approaches the maximum-margin separator.
1 #include <bits/stdc++.h> 2 using namespace std; 3 struct RNG { mt19937_64 g; normal_distribution<double> nd; uniform_real_distribution<double> ud; RNG(unsigned s=123):g(s),nd(0,1),ud(0,1){} double gauss(){return nd(g);} double uni(){return ud(g);} }; 4 using Vec = vector<double>; using Mat = vector<Vec>; 5 6 Mat zeros(int r,int c){ return Mat(r, Vec(c,0.0)); } 7 8 // Solve linear system A x = b by Gaussian elimination with partial pivoting 9 bool solve_linear(Mat A, Vec b, Vec &x){ 10 int n=A.size(); 11 for(int i=0;i<n;++i){ 12 // pivot 13 int piv=i; for(int r=i+1;r<n;++r) if (fabs(A[r][i])>fabs(A[piv][i])) piv=r; 14 if (fabs(A[piv][i]) < 1e-12) return false; 15 if (piv!=i){ swap(A[piv],A[i]); swap(b[piv],b[i]); } 16 double diag = A[i][i]; 17 for(int j=i;j<n;++j) A[i][j]/=diag; b[i]/=diag; 18 for(int r=0;r<n;++r){ if (r==i) continue; double f=A[r][i]; if (fabs(f)<1e-18) continue; for(int j=i;j<n;++j) A[r][j]-=f*A[i][j]; b[r]-=f*b[i]; } 19 } 20 x=b; return true; 21 } 22 23 struct Data { Mat X; Vec y; }; 24 25 Data make_nonlinear_reg(int n, int d_in, unsigned seed=9){ 26 RNG rng(seed); 27 Mat X(n, Vec(d_in)); Vec y(n); 28 for(int i=0;i<n;++i){ 29 for(int j=0;j<d_in;++j) X[i][j] = rng.gauss(); 30 // Nonlinear target: y = sin(x0) + 0.5*cos(x1) + 0.2*x2^2 + noise 31 y[i] = sin(X[i][0]) + 0.5*cos(X[i][1]) + 0.2*X[i][2]*X[i][2] + 0.1*rng.gauss(); 32 } 33 return {X,y}; 34 } 35 36 // Random Fourier Features: z = sqrt(2/m) * cos(W x + b) 37 Mat rff_features(const Mat& X, int m, RNG &rng){ 38 int n=X.size(), d_in=X[0].size(); 39 Mat Z(n, Vec(m,0.0)); 40 // Sample W ~ N(0, I) and b ~ Uniform(0, 2pi) 41 vector<Vec> W(m, Vec(d_in)); Vec b(m); 42 for(int k=0;k<m;++k){ for(int j=0;j<d_in;++j) W[k][j]=rng.gauss(); b[k]= 2*M_PI*rng.uni(); } 43 double s = sqrt(2.0/m); 44 for(int i=0;i<n;++i){ 45 for(int k=0;k<m;++k){ double t=0; for(int j=0;j<d_in;++j) t += W[k][j]*X[i][j]; Z[i][k] = s * cos(t + b[k]); } 46 } 47 return Z; 48 } 49 50 Vec ridge_regress(const Mat& Z, const Vec& y, double lambda){ 51 int n=Z.size(), m=Z[0].size(); 52 // Compute A = Z^T Z + lambda I (m x m), and rhs = Z^T y 53 Mat A(m, Vec(m, 0.0)); Vec rhs(m, 0.0); 54 for(int i=0;i<n;++i){ 55 for(int k=0;k<m;++k){ rhs[k] += Z[i][k]*y[i]; } 56 for(int k=0;k<m;++k){ double zik = Z[i][k]; if (fabs(zik)<1e-18) continue; for(int l=0;l<m;++l) A[k][l] += zik * Z[i][l]; } 57 } 58 for(int k=0;k<m;++k) A[k][k] += lambda; 59 Vec w; 60 bool ok = solve_linear(A, rhs, w); 61 if(!ok){ w.assign(m, 0.0); } 62 return w; 63 } 64 65 Vec matvec(const Mat& A, const Vec& x){ int r=A.size(), c=A[0].size(); Vec y(r,0.0); for(int i=0;i<r;++i){ double s=0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; } 66 67 int main(){ 68 ios::sync_with_stdio(false); cin.tie(nullptr); 69 70 int n_train=120, n_test=1000, d_in=10; 71 auto train = make_nonlinear_reg(n_train, d_in, 21); 72 auto test = make_nonlinear_reg(n_test, d_in, 22); 73 74 vector<int> Ms = {20, 60, 120, 200, 400, 800}; // vary #features; interpolation near m ~ n 75 double lambda = 1e-6; // tiny ridge to stabilize 76 77 cout.setf(std::ios::fixed); cout<<setprecision(6); 78 for(int m: Ms){ 79 RNG rng(100 + m); 80 Mat Ztr = rff_features(train.X, m, rng); 81 Mat Zte = rff_features(test.X, m, rng); // same W,b as above because rng seeded identically per m 82 Vec w = ridge_regress(Ztr, train.y, lambda); 83 auto pred = matvec(Zte, w); 84 // compute MSE 85 auto mse = [&](const Vec& p, const Vec& y){ double s=0; for(size_t i=0;i<p.size();++i){ double e=p[i]-y[i]; s+=e*e; } return s/p.size(); }; 86 // Also compute train MSE 87 auto pred_tr = matvec(Ztr, w); 88 double mse_tr = mse(pred_tr, train.y); 89 double mse_te = mse(pred, test.y); 90 cout << "m="<<setw(4)<<m<<" train_MSE="<<mse_tr<<" test_MSE="<<mse_te<<"\n"; 91 } 92 return 0; 93 } 94
We approximate a kernel method (lazy training flavor) with Random Fourier Features and solve ridge regression while varying the number of features m. As m sweeps through the interpolation threshold around m ≈ n, the printed test MSE typically rises then falls (double descent), while train MSE decreases. This mirrors modern observations in overparameterized models and highlights the stabilizing effect of slight ridge regularization.