📚TheoryAdvanced

Deep Learning Generalization Theory

Key Points

•
Deep learning generalization theory tries to explain why overparameterized networks can fit (interpolate) training data yet still perform well on new data.
•
Classical capacity measures like VC dimension or basic Rademacher complexity are usually too loose for modern deep nets because they only count parameters, not the solutions optimization actually finds.
•
Gradient descent with common losses has an implicit regularization: it prefers minimum-norm or max-margin solutions among all interpolators, which often generalize better.
•
The flatness of minima connects to generalization via PAC-Bayes: broader (flatter) regions allow tighter bounds because small parameter noise does not change predictions much.
•
In very wide networks, training behaves like kernel regression with the Neural Tangent Kernel (NTK), making generalization analyzable by kernel methods.
•
Double descent describes test error that first decreases, then increases near the interpolation threshold, and decreases again as models become extremely overparameterized.
•
Lottery ticket and sparsity results suggest a small, well-initialized subnetwork can often explain most of the learned generalization.
•
Margins, compression, and information-theoretic views (information bottleneck) provide complementary insights, but the full picture is still an active research area.

Prerequisites

→Probability and Statistics — Generalization is defined in terms of expectations and concentration; PAC-Bayes and margin bounds rely on probabilistic reasoning.
→Linear Algebra — Minimum-norm solutions, pseudoinverses, kernels, and margins are linear-algebraic.
→Gradient Descent and Convex Optimization — Implicit regularization arises from optimization dynamics; understanding step sizes and convergence is essential.
→Statistical Learning Theory — Classical PAC, VC dimension, and Rademacher complexity provide baseline generalization concepts to contrast with modern results.
→Kernel Methods — The NTK regime reduces deep network training to kernel regression, requiring familiarity with Gram matrices and ridge regression.

Detailed Explanation

Tap terms for definitions

01Overview

Deep Learning Generalization Theory seeks to answer a paradox: modern neural networks often have far more parameters than training examples and can perfectly fit the training data, yet they still predict well on unseen data. Traditional statistical learning theory suggests that large-capacity models should overfit unless regularized explicitly, but this prediction is at odds with everyday practice in deep learning. This field studies what actually controls generalization in modern regimes. Core ideas include implicit regularization—the observation that optimization methods like gradient descent favor particular low-complexity solutions even without explicit penalties. The geometry of the loss landscape, especially the flatness or sharpness of minima, appears linked to generalization and can be formalized by PAC-Bayes bounds. For very wide networks, the Neural Tangent Kernel (NTK) connects training dynamics to kernel methods, enabling precise predictions in the so-called lazy-training regime. Phenomena like double descent show that test error can improve again after the model reaches the capacity to interpolate, challenging classical bias-variance intuition. Complementary views—margin-based analyses, lottery ticket sparsity, and information bottleneck compression—provide additional, sometimes competing, explanations. Together, these ideas form a toolkit to reason about when and why deep models generalize, though no single theory completely explains all observations.

02Intuition & Analogies

Imagine trying to fit a flexible sheet over a set of pegs on a board: with enough flexibility (many parameters), you can touch every peg exactly. Classical theory warns that a too-flexible sheet will wiggle wildly elsewhere, making poor predictions. Yet, in practice, deep nets often drape smoothly between pegs. Why? One reason is that the sheet isn’t moved arbitrarily—it’s pulled by consistent, gentle tugs (gradient descent) from a particular starting pose (initialization). These tugs preferentially settle the sheet into a smooth configuration that touches the pegs with as little bending as possible, even though many wiggly configurations exist. This is implicit regularization: the search procedure itself has a preference for simple shapes. Flat minima add another layer to the analogy: if the sheet rests on a broad, flat valley, slight nudges (like noise or perturbations) don’t change the fit much, suggesting robustness and better generalization. Sharp valleys, in contrast, change rapidly and may overfit quirks. In very wide networks, another simplification occurs: the sheet is so stiff around its initial pose that training is almost linear—behaving like using a fixed template (kernel) to interpolate, which is well understood mathematically (the NTK view). As models get larger, a surprising double descent can happen: the fit improves, gets worse right when you just barely can touch all pegs (interpolation threshold), and then improves again as the sheet becomes so flexible that it can find very smooth fits that still pass through all pegs. Finally, the lottery ticket idea says you might have a small piece of the sheet that was perfectly shaped from the start; training just finds and amplifies it.

03Formal Definition

Generalization is the ability of a learned predictor f to achieve low population risk R(f) on the data distribution P, given that it is trained to minimize empirical risk \(

\hat{R}

(f)\) on a finite sample: \(R(f) =

E_{(x, y) \sim P}

[

ℓ

(f(x), y)]\), \(

\hat{R}

(f) =

\frac{1}{n} \sum_{i = 1}^{n}

ℓ

(f(

x_{i}

y_{i}

)\). Classical learning theory controls the generalization gap \(R(f) -

\hat{R}

(f)\) via capacity measures like VC dimension or Rademacher complexity of the hypothesis class \(

H

\), producing bounds of the form \(R(f)

\leq

\hat{R}

(f) + O(

R_{n}

(

H

))\). In deep learning, the apparent capacity of \(

H

\) is enormous, yet generalization occurs. Modern theories refine the effective hypothesis class by incorporating training dynamics and solution geometry. For squared loss in linear models with underdetermined systems, continuous-time gradient flow from zero initialization converges to the minimum-\(

ℓ_{2}

\)-norm interpolant \(

w^{⋆} = X^{†}

y\). For separable data and logistic loss, gradient descent’s direction converges to the maximum-margin classifier \(

ar g min_{w}

∥

∥_{2}

\) subject to \(

y_{i}

⊤

x_{i}

\geq

1\), even without explicit regularization. PAC-Bayes bounds upper-bound expected risk for a stochastic predictor by a term depending on the empirical risk and a KL divergence between a posterior over parameters and a prior, capturing flatness through local posteriors. In wide networks, the Neural Tangent Kernel linearizes training dynamics, yielding kernel regression with kernel \(

K_{θ_{0}}

(x,x') =

\nabla

θ

f_{θ_{0}}

(x)^

⊤

\nabla

θ

f_{θ_{0}}

(x')\), enabling precise generalization analysis under overparameterization.

04When to Use

Use these ideas to reason about model size, training procedures, and regularization choices in regimes where classical parameter-count arguments fail. If you are training extremely wide networks or using very small explicit regularization yet seeing good test performance, NTK and implicit bias perspectives explain stability and can inform learning-rate and initialization choices. When your data are (nearly) linearly separable in the learned representation, margin-based theory motivates using losses like logistic or exponential and monitoring margins, rather than only accuracy. PAC-Bayes and flat-minima reasoning are useful when you consider noise injection, stochastic weight averaging, or ensembling; they justify why averaging or finding broader basins can improve generalization. When scaling width far beyond sample size, double descent warns that performance can worsen near the interpolation threshold, suggesting either more overparameterization (counterintuitive but often effective), small ridge regularization, or early stopping. If compute is tight, lottery ticket insights suggest pruning or sparse training can preserve accuracy, especially when fine-tuning large models. Finally, compression and information bottleneck views guide architecture choices that encourage succinct internal representations—for example, bottleneck layers or dropout—to reduce overfitting when label noise is present.

⚠️Common Mistakes

Equating parameter count with overfitting: In deep learning, more parameters can improve generalization after the interpolation threshold; use validation curves or margins, not raw counts.
Ignoring optimization’s role: Bounds on the entire hypothesis class are loose; what matters is the solution gradient descent finds. Misinterpreting theory by neglecting implicit bias leads to incorrect regularization strategies.
Overreliance on flatness as a scalar: Flat vs. sharp depends on parameterization and scale; compare using scale-invariant measures (e.g., normalized sharpness) or PAC-Bayes KL, not raw Hessian norms.
Confusing separability with simplicity: Large margins indicate robustness, but separable data with small margins can still overfit; monitor normalized margins and calibration, not just zero training error.
Treating NTK as universal: NTK applies best in very wide, small-learning-rate regimes. In feature-learning regimes (finite width, larger steps), linearization fails; expect representation drift and different generalization.
Misreading double descent: The interpolation peak can be mitigated by tiny ridge regularization or data augmentation; do not conclude that all interpolating models generalize poorly.
Ignoring label noise: Interpolation with noisy labels harms generalization; employ early stopping, small weight decay, or robust losses. Theories explaining benign interpolation often assume clean labels.
Cherry-picking lottery tickets: Not every network has an easily findable winning ticket for every task; pruning schedules and initializations matter. Validate sparsity claims on held-out data.

Key Formulas

Population and Empirical Risk

R (f) = E_{(x, y) \sim P} [ℓ (f (x), y)], \hat{R} (f) = \frac{1}{n} i = 1 \sum n ℓ (f (x_{i}), y_{i})

Explanation: Population risk is expected loss on the true distribution; empirical risk is average loss on the training set. Generalization is about making R(f) close to $\hat{R}$ (f).

Rademacher Complexity Bound

R (f) \leq \hat{R} (f) + 2 R_{n} (H) + 3 \frac{lo g ( 2/ δ )}{2 n}

Explanation: With probability at least 1- $δ$ , the population risk is bounded by empirical risk plus a complexity term and a concentration term. For deep nets, $ $R_{n}$ ( $H$ )$ can be large, making the bound loose.

Minimum-Norm Interpolant (Underdetermined)

w^{⋆} = X^{†} y = X^{⊤} (X X^{⊤})^{- 1} y

Explanation: For linear regression with more parameters than samples and full row rank, the minimum-$ $ℓ_{2}$ $-norm solution that interpolates data is given by this pseudoinverse expression. Gradient flow from zero converges to this solution.

Max-Margin Classifier

\overset{w}{^} = ar g w min ∥ w ∥_{2} s.t. y_{i} w^{⊤} x_{i} \geq 1 \forall i

Explanation: Among all linear separators, the one with the largest margin has minimum $ $ℓ_{2}$ $ norm under the unit-margin constraint. Logistic/exponential-loss gradient descent on separable data converges in direction to $ $\overset{w}{^}$ $.

First-Order Linearization

f_{θ} (x) \approx f_{θ_{0}} (x) + \nabla_{θ} f_{θ_{0}} (x)^{⊤} (θ - θ_{0})

Explanation: In wide networks with small updates, the model behaves like its first-order Taylor expansion around initialization. This underlies the NTK analysis.

Neural Tangent Kernel

K_{θ_{0}} (x, x^{'}) = \nabla_{θ} f_{θ_{0}} (x)^{⊤} \nabla_{θ} f_{θ_{0}} (x^{'})

Explanation: The NTK is the inner product of network gradients at initialization. Training dynamics reduce to kernel regression with this kernel in the lazy-training regime.

PAC-Bayes Bound (One Form)

E_{w \sim Q} [R (w)] \leq E_{w \sim Q} [\hat{R} (w)] + \frac{KL ( Q ∥ P ) + lo g \frac{2 n}{δ}}{2 ( n - 1 )}

Explanation: For any prior P and posterior Q over parameters, the expected true risk is bounded by the expected empirical risk plus a term depending on the KL divergence and sample size. Choosing Q concentrated in flat regions yields smaller KL and tighter bounds.

Quadratic Approximation and Sharpness

L (θ + Δ) \approx L (θ) + \nabla L (θ)^{⊤} Δ + \frac{1}{2} Δ^{⊤} H (θ) Δ

Explanation: A Taylor expansion of the loss shows how curvature (Hessian H) affects sensitivity to perturbations. Flatter minima have smaller eigenvalues of H, implying robustness.

Double Descent Heuristic

E_{test} (m) shows double descent around m \approx n

Explanation: When the number of features/parameters m crosses the sample size n, test error can peak and then decrease again for m $≫$ n, especially with tiny or no regularization.

Mutual Information

I (X; T) = x, t \sum p (x, t) lo g \frac{p ( x , t )}{p ( x ) p ( t )}

Explanation: Information bottleneck views study how hidden representations T retain label-relevant information while discarding input details. Reductions in I(X;T) can correlate with better generalization in some settings.

Complexity Analysis

The code examples use first-order methods and closed-form solvers for linear or kernelized models, reflecting the computational trade-offs common in generalization studies. For overparameterized linear regression trained by gradient descent (Example 1), each epoch costs O(n d) time for matrix–vector products, with O(d) memory for parameters and O(n d) to store the design matrix X. Comparing to the minimum-norm pseudoinverse uses the n×n Gram matrix

G = X

X^{T}

; forming G costs O(n d), inverting G by Gaussian elimination is O(

n^{3}

), and computing w* =

X^{T}

G^{- 1}

y is O(n d), with O(

n^{2}

) extra space for G and its inverse. This is efficient when n ≪ d, the typical overparameterized regime. For logistic regression on separable data (Example 2), one gradient step requires O(n d) time and O(d) memory; the number of steps to reach a target margin depends on the margin and step size. While the weight norm diverges, monitoring the normalized margin adds negligible cost. Evaluation on a test set of size

n_{t} es t

adds O(

n_{t} es t

d). Random feature regression (Example 3) constructs Z ∈

R^{n \times m}

in O(n m

d_{i} n

) time, then solves (

Z^{T}

Z +

λ I)

w = Z^{T}

y. Forming

Z^{T}

Z is O(n

m^{2}

), and solving via Gaussian elimination is O(

m^{3}

) time and O(

m^{2}

) memory. When m ≈ n (the interpolation threshold), costs scale cubically in m; for m ≫ n, iterative solvers or Woodbury identities can reduce cost. These complexities illustrate that kernel/linearized analyses are computationally tractable in the overparameterized but wide-feature regime, while deep nonlinear feature learning typically requires more expensive backpropagation not shown here.

Code Examples

Implicit bias to minimum-norm: Gradient descent vs. pseudoinverse in overparameterized linear regression

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility: random number generator
5 struct RNG {
6     mt19937_64 gen;
7     normal_distribution<double> nd;
8     uniform_real_distribution<double> ud;
9     RNG(unsigned seed=42): gen(seed), nd(0.0,1.0), ud(0.0,1.0) {}
10     double gauss() { return nd(gen); }
11     double uni() { return ud(gen); }
12 };
13 
14 using Vec = vector<double>;
15 using Mat = vector<Vec>;
16 
17 Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); }
18 Vec zeros_vec(int n){ return Vec(n, 0.0); }
19 
20 Mat transpose(const Mat& A){
21     int r=A.size(), c=A[0].size();
22     Mat AT(c, Vec(r));
23     for(int i=0;i<r;++i) for(int j=0;j<c;++j) AT[j][i]=A[i][j];
24     return AT;
25 }
26 
27 Vec matvec(const Mat& A, const Vec& x){
28     int r=A.size(), c=A[0].size();
29     Vec y(r,0.0);
30     for(int i=0;i<r;++i){
31         double s=0.0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s;
32     }
33     return y;
34 }
35 
36 Mat matmul(const Mat& A, const Mat& B){
37     int r=A.size(), k=A[0].size(), c=B[0].size();
38     Mat C(r, Vec(c, 0.0));
39     for(int i=0;i<r;++i){
40         for(int t=0;t<k;++t){
41             double a = A[i][t];
42             if (a==0) continue;
43             for(int j=0;j<c;++j) C[i][j] += a * B[t][j];
44         }
45     }
46     return C;
47 }
48 
49 Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
50 Vec sub(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]-b[i]; return c; }
51 Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
52 double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
53 double norm2(const Vec& a){ return sqrt(max(0.0, dot(a,a))); }
54 
55 // Gaussian elimination with partial pivoting to invert a small matrix
56 bool invertMatrix(Mat A, Mat &Ainv){
57     int n = A.size();
58     Ainv = zeros(n,n);
59     for(int i=0;i<n;++i) Ainv[i][i]=1.0;
60     for(int col=0; col<n; ++col){
61         // pivot
62         int piv = col;
63         for(int r=col+1;r<n;++r) if (fabs(A[r][col])>fabs(A[piv][col])) piv=r;
64         if (fabs(A[piv][col]) < 1e-12) return false; // singular
65         if (piv!=col){ swap(A[piv],A[col]); swap(Ainv[piv],Ainv[col]); }
66         double diag = A[col][col];
67         for(int j=0;j<n;++j){ A[col][j]/=diag; Ainv[col][j]/=diag; }
68         for(int r=0;r<n;++r){ if (r==col) continue; double f=A[r][col]; if (fabs(f)<1e-18) continue; for(int j=0;j<n;++j){ A[r][j]-=f*A[col][j]; Ainv[r][j]-=f*Ainv[col][j]; } }
69     }
70     return true;
71 }
72 
73 int main(){
74     ios::sync_with_stdio(false);
75     cin.tie(nullptr);
76 
77     RNG rng(123);
78 
79     int n = 30;      // samples
80     int d = 100;     // features (overparameterized: d >> n)
81     // Generate random design X (n x d)
82     Mat X(n, Vec(d));
83     for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j] = rng.gauss()/sqrt(d);
84 
85     // Create a ground-truth sparse w_true and noiseless targets y
86     Vec w_true(d, 0.0);
87     for(int j=0;j<5;++j){ int idx = j* (d/5); w_true[idx] = (j%2? -2.0: 2.0); }
88     Vec y = matvec(X, w_true); // perfect fit possible
89 
90     // Gradient descent from zero on squared loss: minimize (1/2n)||X w - y||^2
91     Vec w(d, 0.0);
92     double lr = 1.0; // step size; safe for our scaled X
93     int epochs = 2000;
94     for(int e=0;e<epochs;++e){
95         Vec r = sub(matvec(X, w), y);             // residual (n)
96         // gradient: (1/n) X^T (X w - y)
97         Mat XT = transpose(X);
98         Vec g(d, 0.0);
99         for(int j=0;j<d;++j){ g[j] = 0.0; for(int i=0;i<n;++i) g[j] += XT[j][i] * r[i]; g[j] /= n; }
100         // update
101         for(int j=0;j<d;++j) w[j] -= lr * g[j];
102         if ((e+1)%500==0){
103             double mse=0; for(double ri: r) mse+=ri*ri; mse/=n;
104             cerr << "epoch " << (e+1) << ": train MSE=" << mse << ", ||w||2=" << norm2(w) << "\n";
105         }
106     }
107 
108     // Compute minimum-norm interpolant via pseudoinverse: w_mn = X^T (X X^T)^{-1} y
109     Mat XT = transpose(X);
110     Mat G = matmul(X, XT); // n x n
111     Mat Ginv;
112     bool ok = invertMatrix(G, Ginv);
113     if(!ok){ cerr << "Gram matrix not invertible; try regenerating data.\n"; return 0; }
114     // Compute alpha = Ginv * y
115     Vec alpha(n,0.0);
116     for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s += Ginv[i][j]*y[j]; alpha[i]=s; }
117     // w_mn = X^T * alpha
118     Vec w_mn(d,0.0);
119     for(int j=0;j<d;++j){ double s=0; for(int i=0;i<n;++i) s += XT[j][i]*alpha[i]; w_mn[j]=s; }
120 
121     // Compare norms and training error
122     Vec r_gd = sub(matvec(X, w), y);
123     Vec r_mn = sub(matvec(X, w_mn), y);
124     auto mse_of = [&](const Vec& r){ double s=0; for(double v: r) s+=v*v; return s/n; };
125     cout.setf(std::ios::fixed); cout<<setprecision(6);
126     cout << "GD:    MSE=" << mse_of(r_gd) << ", ||w||2=" << norm2(w) << "\n";
127     cout << "PInv:  MSE=" << mse_of(r_mn) << ", ||w_mn||2=" << norm2(w_mn) << "\n";
128     cout << "||w - w_mn||2=" << norm2(sub(w, w_mn)) << "\n";
129 
130     return 0;
131 }
132

We construct an overparameterized linear regression problem (d ≫ n) with noiseless labels, run gradient descent from zero, and compare the learned weights to the minimum-norm interpolant computed via the pseudoinverse formula w* = X^T (X X^T)^{-1} y. In this regime, gradient descent converges to the minimum-ℓ2-norm interpolator among infinitely many solutions, empirically shown by near-identical norms and parameters.

Time: Per epoch O(n d) for gradient computation; total O(epochs × n d). Pseudoinverse via Gram inversion costs O(n d + n^3 + n d).Space: O(n d) to store X, O(d) for parameters, and O(n^2) for the Gram matrix and its inverse.

Implicit max-margin: Logistic regression on separable data grows margin without explicit regularization

1 #include <bits/stdc++.h>
2 using namespace std;
3 struct RNG { mt19937_64 g; normal_distribution<double> nd; RNG(unsigned s=1):g(s),nd(0,1){} double gauss(){return nd(g);} double uni(){return uniform_real_distribution<double>(0,1)(g);} };
4 using Vec = vector<double>; using Mat = vector<Vec>;
5 
6 double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
7 Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
8 Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
9 double norm2(const Vec& a){ double s=0; for(double v:a) s+=v*v; return sqrt(max(0.0,s)); }
10 
11 struct Dataset { Mat X; vector<int> y; };
12 
13 Dataset make_separable(int n_per_class, int d, unsigned seed=7){
14     RNG rng(seed);
15     Mat X; vector<int> y;
16     // Two Gaussians along first coordinate; rest are noise
17     for(int i=0;i<n_per_class;i++){
18         Vec v(d,0.0);
19         v[0] = 3.0 + 0.5*rng.gauss();
20         for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
21         X.push_back(v); y.push_back(+1);
22     }
23     for(int i=0;i<n_per_class;i++){
24         Vec v(d,0.0);
25         v[0] = -3.0 + 0.5*rng.gauss();
26         for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
27         X.push_back(v); y.push_back(-1);
28     }
29     return {X,y};
30 }
31 
32 double logistic_loss(const Vec& w, const Mat& X, const vector<int>& y){
33     double L=0; int n=X.size();
34     for(int i=0;i<n;++i){ double z = y[i]*dot(w, X[i]); L += log(1.0 + exp(-z)); }
35     return L/n;
36 }
37 
38 Vec logistic_grad(const Vec& w, const Mat& X, const vector<int>& y){
39     int n=X.size(), d=X[0].size(); Vec g(d,0.0);
40     for(int i=0;i<n;++i){
41         double z = y[i]*dot(w, X[i]);
42         double s = 1.0/(1.0 + exp(z)); // = sigmoid(-z)
43         double coeff = -y[i]*s; // gradient multiplier
44         for(int j=0;j<d;++j) g[j] += coeff * X[i][j];
45     }
46     for(int j=0;j<d;++j) g[j] /= n;
47     return g;
48 }
49 
50 double min_normalized_margin(const Vec& w, const Mat& X, const vector<int>& y){
51     double nw = norm2(w); if (nw==0) return 0.0;
52     double m = 1e9; int n=X.size();
53     for(int i=0;i<n;++i){ m = min(m, y[i]*dot(w, X[i])/nw); }
54     return m;
55 }
56 
57 pair<double,double> accuracy_and_margin(const Vec& w, const Mat& X, const vector<int>& y){
58     int n=X.size(); int correct=0; for(int i=0;i<n;++i){ double s=dot(w,X[i]); if ((s>=0 && y[i]==+1) || (s<0 && y[i]==-1)) correct++; }
59     double acc = (double)correct/n; double m = min_normalized_margin(w, X, y);
60     return {acc,m};
61 }
62 
63 int main(){
64     ios::sync_with_stdio(false); cin.tie(nullptr);
65     int n_per_class = 100, d = 50;
66     auto train = make_separable(n_per_class, d, 11);
67     auto test  = make_separable(n_per_class, d, 12);
68 
69     Vec w(d,0.0);
70     double lr0 = 1.0; int epochs=2000;
71     for(int e=0;e<epochs;++e){
72         Vec g = logistic_grad(w, train.X, train.y);
73         // simple diminishing step size to stabilize
74         double lr = lr0 / sqrt(1.0 + e*0.1);
75         for(int j=0;j<d;++j) w[j] -= lr * g[j];
76         if ((e+1)%400==0){
77             auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
78             auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
79             double L = logistic_loss(w, train.X, train.y);
80             cerr<<"epoch "<<(e+1)<<" loss="<<L<<" train_acc="<<acc_tr<<" test_acc="<<acc_te<<" norm="<<norm2(w)<<" min_norm_margin="<<m_tr<<"\n";
81         }
82     }
83 
84     auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
85     auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
86     cout.setf(std::ios::fixed); cout<<setprecision(6);
87     cout << "Final train_acc=" << acc_tr << ", test_acc=" << acc_te << ", ||w||2=" << norm2(w) << ", min_norm_margin_train=" << m_tr << "\n";
88     return 0;
89 }
90

We generate linearly separable data and train logistic regression with no explicit regularization from zero initialization. The training loss decreases toward zero, the weight norm grows, and the normalized margin increases—illustrating the implicit max-margin bias of gradient descent on separable data. Despite interpolation (zero loss in the limit), test accuracy is strong because the learned direction approaches the maximum-margin separator.

Time: Each epoch is O(n d) for gradient computation; total O(epochs × n d). Evaluation adds O(n_test d).Space: O(d) for parameters and O(n d) to store the dataset.

Double descent via random feature regression (lazy training / NTK flavor)

1 #include <bits/stdc++.h>
2 using namespace std;
3 struct RNG { mt19937_64 g; normal_distribution<double> nd; uniform_real_distribution<double> ud; RNG(unsigned s=123):g(s),nd(0,1),ud(0,1){} double gauss(){return nd(g);} double uni(){return ud(g);} };
4 using Vec = vector<double>; using Mat = vector<Vec>;
5 
6 Mat zeros(int r,int c){ return Mat(r, Vec(c,0.0)); }
7 
8 // Solve linear system A x = b by Gaussian elimination with partial pivoting
9 bool solve_linear(Mat A, Vec b, Vec &x){
10     int n=A.size();
11     for(int i=0;i<n;++i){
12         // pivot
13         int piv=i; for(int r=i+1;r<n;++r) if (fabs(A[r][i])>fabs(A[piv][i])) piv=r;
14         if (fabs(A[piv][i]) < 1e-12) return false;
15         if (piv!=i){ swap(A[piv],A[i]); swap(b[piv],b[i]); }
16         double diag = A[i][i];
17         for(int j=i;j<n;++j) A[i][j]/=diag; b[i]/=diag;
18         for(int r=0;r<n;++r){ if (r==i) continue; double f=A[r][i]; if (fabs(f)<1e-18) continue; for(int j=i;j<n;++j) A[r][j]-=f*A[i][j]; b[r]-=f*b[i]; }
19     }
20     x=b; return true;
21 }
22 
23 struct Data { Mat X; Vec y; };
24 
25 Data make_nonlinear_reg(int n, int d_in, unsigned seed=9){
26     RNG rng(seed);
27     Mat X(n, Vec(d_in)); Vec y(n);
28     for(int i=0;i<n;++i){
29         for(int j=0;j<d_in;++j) X[i][j] = rng.gauss();
30         // Nonlinear target: y = sin(x0) + 0.5*cos(x1) + 0.2*x2^2 + noise
31         y[i] = sin(X[i][0]) + 0.5*cos(X[i][1]) + 0.2*X[i][2]*X[i][2] + 0.1*rng.gauss();
32     }
33     return {X,y};
34 }
35 
36 // Random Fourier Features: z = sqrt(2/m) * cos(W x + b)
37 Mat rff_features(const Mat& X, int m, RNG &rng){
38     int n=X.size(), d_in=X[0].size();
39     Mat Z(n, Vec(m,0.0));
40     // Sample W ~ N(0, I) and b ~ Uniform(0, 2pi)
41     vector<Vec> W(m, Vec(d_in)); Vec b(m);
42     for(int k=0;k<m;++k){ for(int j=0;j<d_in;++j) W[k][j]=rng.gauss(); b[k]= 2*M_PI*rng.uni(); }
43     double s = sqrt(2.0/m);
44     for(int i=0;i<n;++i){
45         for(int k=0;k<m;++k){ double t=0; for(int j=0;j<d_in;++j) t += W[k][j]*X[i][j]; Z[i][k] = s * cos(t + b[k]); }
46     }
47     return Z;
48 }
49 
50 Vec ridge_regress(const Mat& Z, const Vec& y, double lambda){
51     int n=Z.size(), m=Z[0].size();
52     // Compute A = Z^T Z + lambda I (m x m), and rhs = Z^T y
53     Mat A(m, Vec(m, 0.0)); Vec rhs(m, 0.0);
54     for(int i=0;i<n;++i){
55         for(int k=0;k<m;++k){ rhs[k] += Z[i][k]*y[i]; }
56         for(int k=0;k<m;++k){ double zik = Z[i][k]; if (fabs(zik)<1e-18) continue; for(int l=0;l<m;++l) A[k][l] += zik * Z[i][l]; }
57     }
58     for(int k=0;k<m;++k) A[k][k] += lambda;
59     Vec w;
60     bool ok = solve_linear(A, rhs, w);
61     if(!ok){ w.assign(m, 0.0); }
62     return w;
63 }
64 
65 Vec matvec(const Mat& A, const Vec& x){ int r=A.size(), c=A[0].size(); Vec y(r,0.0); for(int i=0;i<r;++i){ double s=0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s; } return y; }
66 
67 int main(){
68     ios::sync_with_stdio(false); cin.tie(nullptr);
69 
70     int n_train=120, n_test=1000, d_in=10;
71     auto train = make_nonlinear_reg(n_train, d_in, 21);
72     auto test  = make_nonlinear_reg(n_test,  d_in, 22);
73 
74     vector<int> Ms = {20, 60, 120, 200, 400, 800}; // vary #features; interpolation near m ~ n
75     double lambda = 1e-6; // tiny ridge to stabilize
76 
77     cout.setf(std::ios::fixed); cout<<setprecision(6);
78     for(int m: Ms){
79         RNG rng(100 + m);
80         Mat Ztr = rff_features(train.X, m, rng);
81         Mat Zte = rff_features(test.X,  m, rng); // same W,b as above because rng seeded identically per m
82         Vec w = ridge_regress(Ztr, train.y, lambda);
83         auto pred = matvec(Zte, w);
84         // compute MSE
85         auto mse = [&](const Vec& p, const Vec& y){ double s=0; for(size_t i=0;i<p.size();++i){ double e=p[i]-y[i]; s+=e*e; } return s/p.size(); };
86         // Also compute train MSE
87         auto pred_tr = matvec(Ztr, w);
88         double mse_tr = mse(pred_tr, train.y);
89         double mse_te = mse(pred, test.y);
90         cout << "m="<<setw(4)<<m<<" train_MSE="<<mse_tr<<" test_MSE="<<mse_te<<"\n";
91     }
92     return 0;
93 }
94

We approximate a kernel method (lazy training flavor) with Random Fourier Features and solve ridge regression while varying the number of features m. As m sweeps through the interpolation threshold around m ≈ n, the printed test MSE typically rises then falls (double descent), while train MSE decreases. This mirrors modern observations in overparameterized models and highlights the stabilizing effect of slight ridge regularization.

Time: Feature construction O(n m d_in); forming Z^T Z is O(n m^2); solving the m×m system is O(m^3). Prediction is O(n_test m).Space: O(n m) for the feature matrix and O(m^2) for the normal equations.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility: random number generator
5	struct RNG {
6	mt19937_64 gen;
7	normal_distribution<double> nd;
8	uniform_real_distribution<double> ud;
9	RNG(unsigned seed=42): gen(seed), nd(0.0,1.0), ud(0.0,1.0) {}
10	double gauss() { return nd(gen); }
11	double uni() { return ud(gen); }
12	};
13
14	using Vec = vector<double>;
15	using Mat = vector<Vec>;
16
17	Mat zeros(int r, int c){ return Mat(r, Vec(c, 0.0)); }
18	Vec zeros_vec(int n){ return Vec(n, 0.0); }
19
20	Mat transpose(const Mat& A){
21	int r=A.size(), c=A[0].size();
22	Mat AT(c, Vec(r));
23	for(int i=0;i<r;++i) for(int j=0;j<c;++j) AT[j][i]=A[i][j];
24	return AT;
25	}
26
27	Vec matvec(const Mat& A, const Vec& x){
28	int r=A.size(), c=A[0].size();
29	Vec y(r,0.0);
30	for(int i=0;i<r;++i){
31	double s=0.0; for(int j=0;j<c;++j) s+=A[i][j]*x[j]; y[i]=s;
32	}
33	return y;
34	}
35
36	Mat matmul(const Mat& A, const Mat& B){
37	int r=A.size(), k=A[0].size(), c=B[0].size();
38	Mat C(r, Vec(c, 0.0));
39	for(int i=0;i<r;++i){
40	for(int t=0;t<k;++t){
41	double a = A[i][t];
42	if (a==0) continue;
43	for(int j=0;j<c;++j) C[i][j] += a * B[t][j];
44	}
45	}
46	return C;
47	}
48
49	Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
50	Vec sub(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]-b[i]; return c; }
51	Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
52	double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
53	double norm2(const Vec& a){ return sqrt(max(0.0, dot(a,a))); }
54
55	// Gaussian elimination with partial pivoting to invert a small matrix
56	bool invertMatrix(Mat A, Mat &Ainv){
57	int n = A.size();
58	Ainv = zeros(n,n);
59	for(int i=0;i<n;++i) Ainv[i][i]=1.0;
60	for(int col=0; col<n; ++col){
61	// pivot
62	int piv = col;
63	for(int r=col+1;r<n;++r) if (fabs(A[r][col])>fabs(A[piv][col])) piv=r;
64	if (fabs(A[piv][col]) < 1e-12) return false; // singular
65	if (piv!=col){ swap(A[piv],A[col]); swap(Ainv[piv],Ainv[col]); }
66	double diag = A[col][col];
67	for(int j=0;j<n;++j){ A[col][j]/=diag; Ainv[col][j]/=diag; }
68	for(int r=0;r<n;++r){ if (r==col) continue; double f=A[r][col]; if (fabs(f)<1e-18) continue; for(int j=0;j<n;++j){ A[r][j]-=fA[col][j]; Ainv[r][j]-=fAinv[col][j]; } }
69	}
70	return true;
71	}
72
73	int main(){
74	ios::sync_with_stdio(false);
75	cin.tie(nullptr);
76
77	RNG rng(123);
78
79	int n = 30; // samples
80	int d = 100; // features (overparameterized: d >> n)
81	// Generate random design X (n x d)
82	Mat X(n, Vec(d));
83	for(int i=0;i<n;++i) for(int j=0;j<d;++j) X[i][j] = rng.gauss()/sqrt(d);
84
85	// Create a ground-truth sparse w_true and noiseless targets y
86	Vec w_true(d, 0.0);
87	for(int j=0;j<5;++j){ int idx = j* (d/5); w_true[idx] = (j%2? -2.0: 2.0); }
88	Vec y = matvec(X, w_true); // perfect fit possible
89
90	// Gradient descent from zero on squared loss: minimize (1/2n)\|\|X w - y\|\|^2
91	Vec w(d, 0.0);
92	double lr = 1.0; // step size; safe for our scaled X
93	int epochs = 2000;
94	for(int e=0;e<epochs;++e){
95	Vec r = sub(matvec(X, w), y); // residual (n)
96	// gradient: (1/n) X^T (X w - y)
97	Mat XT = transpose(X);
98	Vec g(d, 0.0);
99	for(int j=0;j<d;++j){ g[j] = 0.0; for(int i=0;i<n;++i) g[j] += XT[j][i] * r[i]; g[j] /= n; }
100	// update
101	for(int j=0;j<d;++j) w[j] -= lr * g[j];
102	if ((e+1)%500==0){
103	double mse=0; for(double ri: r) mse+=ri*ri; mse/=n;
104	cerr << "epoch " << (e+1) << ": train MSE=" << mse << ", \|\|w\|\|2=" << norm2(w) << "\n";
105	}
106	}
107
108	// Compute minimum-norm interpolant via pseudoinverse: w_mn = X^T (X X^T)^{-1} y
109	Mat XT = transpose(X);
110	Mat G = matmul(X, XT); // n x n
111	Mat Ginv;
112	bool ok = invertMatrix(G, Ginv);
113	if(!ok){ cerr << "Gram matrix not invertible; try regenerating data.\n"; return 0; }
114	// Compute alpha = Ginv * y
115	Vec alpha(n,0.0);
116	for(int i=0;i<n;++i){ double s=0; for(int j=0;j<n;++j) s += Ginv[i][j]*y[j]; alpha[i]=s; }
117	// w_mn = X^T * alpha
118	Vec w_mn(d,0.0);
119	for(int j=0;j<d;++j){ double s=0; for(int i=0;i<n;++i) s += XT[j][i]*alpha[i]; w_mn[j]=s; }
120
121	// Compare norms and training error
122	Vec r_gd = sub(matvec(X, w), y);
123	Vec r_mn = sub(matvec(X, w_mn), y);
124	auto mse_of = [&](const Vec& r){ double s=0; for(double v: r) s+=v*v; return s/n; };
125	cout.setf(std::ios::fixed); cout<<setprecision(6);
126	cout << "GD: MSE=" << mse_of(r_gd) << ", \|\|w\|\|2=" << norm2(w) << "\n";
127	cout << "PInv: MSE=" << mse_of(r_mn) << ", \|\|w_mn\|\|2=" << norm2(w_mn) << "\n";
128	cout << "\|\|w - w_mn\|\|2=" << norm2(sub(w, w_mn)) << "\n";
129
130	return 0;
131	}
132

1	#include <bits/stdc++.h>
2	using namespace std;
3	struct RNG { mt19937_64 g; normal_distribution<double> nd; RNG(unsigned s=1):g(s),nd(0,1){} double gauss(){return nd(g);} double uni(){return uniform_real_distribution<double>(0,1)(g);} };
4	using Vec = vector<double>; using Mat = vector<Vec>;
5
6	double dot(const Vec& a, const Vec& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
7	Vec add(const Vec& a, const Vec& b){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]+b[i]; return c; }
8	Vec scale(const Vec& a, double s){ Vec c(a.size()); for(size_t i=0;i<a.size();++i) c[i]=a[i]*s; return c; }
9	double norm2(const Vec& a){ double s=0; for(double v:a) s+=v*v; return sqrt(max(0.0,s)); }
10
11	struct Dataset { Mat X; vector<int> y; };
12
13	Dataset make_separable(int n_per_class, int d, unsigned seed=7){
14	RNG rng(seed);
15	Mat X; vector<int> y;
16	// Two Gaussians along first coordinate; rest are noise
17	for(int i=0;i<n_per_class;i++){
18	Vec v(d,0.0);
19	v[0] = 3.0 + 0.5*rng.gauss();
20	for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
21	X.push_back(v); y.push_back(+1);
22	}
23	for(int i=0;i<n_per_class;i++){
24	Vec v(d,0.0);
25	v[0] = -3.0 + 0.5*rng.gauss();
26	for(int j=1;j<d;++j) v[j] = 0.3*rng.gauss();
27	X.push_back(v); y.push_back(-1);
28	}
29	return {X,y};
30	}
31
32	double logistic_loss(const Vec& w, const Mat& X, const vector<int>& y){
33	double L=0; int n=X.size();
34	for(int i=0;i<n;++i){ double z = y[i]*dot(w, X[i]); L += log(1.0 + exp(-z)); }
35	return L/n;
36	}
37
38	Vec logistic_grad(const Vec& w, const Mat& X, const vector<int>& y){
39	int n=X.size(), d=X[0].size(); Vec g(d,0.0);
40	for(int i=0;i<n;++i){
41	double z = y[i]*dot(w, X[i]);
42	double s = 1.0/(1.0 + exp(z)); // = sigmoid(-z)
43	double coeff = -y[i]*s; // gradient multiplier
44	for(int j=0;j<d;++j) g[j] += coeff * X[i][j];
45	}
46	for(int j=0;j<d;++j) g[j] /= n;
47	return g;
48	}
49
50	double min_normalized_margin(const Vec& w, const Mat& X, const vector<int>& y){
51	double nw = norm2(w); if (nw==0) return 0.0;
52	double m = 1e9; int n=X.size();
53	for(int i=0;i<n;++i){ m = min(m, y[i]*dot(w, X[i])/nw); }
54	return m;
55	}
56
57	pair<double,double> accuracy_and_margin(const Vec& w, const Mat& X, const vector<int>& y){
58	int n=X.size(); int correct=0; for(int i=0;i<n;++i){ double s=dot(w,X[i]); if ((s>=0 && y[i]==+1) \|\| (s<0 && y[i]==-1)) correct++; }
59	double acc = (double)correct/n; double m = min_normalized_margin(w, X, y);
60	return {acc,m};
61	}
62
63	int main(){
64	ios::sync_with_stdio(false); cin.tie(nullptr);
65	int n_per_class = 100, d = 50;
66	auto train = make_separable(n_per_class, d, 11);
67	auto test = make_separable(n_per_class, d, 12);
68
69	Vec w(d,0.0);
70	double lr0 = 1.0; int epochs=2000;
71	for(int e=0;e<epochs;++e){
72	Vec g = logistic_grad(w, train.X, train.y);
73	// simple diminishing step size to stabilize
74	double lr = lr0 / sqrt(1.0 + e*0.1);
75	for(int j=0;j<d;++j) w[j] -= lr * g[j];
76	if ((e+1)%400==0){
77	auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
78	auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
79	double L = logistic_loss(w, train.X, train.y);
80	cerr<<"epoch "<<(e+1)<<" loss="<<L<<" train_acc="<<acc_tr<<" test_acc="<<acc_te<<" norm="<<norm2(w)<<" min_norm_margin="<<m_tr<<"\n";
81	}
82	}
83
84	auto [acc_tr, m_tr] = accuracy_and_margin(w, train.X, train.y);
85	auto [acc_te, m_te] = accuracy_and_margin(w, test.X, test.y);
86	cout.setf(std::ios::fixed); cout<<setprecision(6);
87	cout << "Final train_acc=" << acc_tr << ", test_acc=" << acc_te << ", \|\|w\|\|2=" << norm2(w) << ", min_norm_margin_train=" << m_tr << "\n";
88	return 0;
89	}
90