🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Double Descent Phenomenon

Key Points

  • •
    Double descent describes how test error first follows the classic U-shape with increasing model complexity, spikes near the interpolation threshold, and then drops again in the highly overparameterized regime.
  • •
    The interpolation threshold occurs when the model has just enough parameters to fit (interpolate) the training data, often when the number of parameters roughly equals the number of training samples.
  • •
    In overparameterized settings, implicit or explicit regularization often selects a minimum-norm interpolating solution that can generalize surprisingly well (benign overfitting).
  • •
    Linear models with random features and tiny ridge regularization often exhibit clear double descent as we increase the number of features through and beyond the sample size.
  • •
    Numerical instability near the interpolation threshold can inflate test error and must be handled with care in implementations (e.g., by ridge or better solvers).
  • •
    Double descent does not mean ‘more parameters is always better’; label noise, data distribution, and optimization bias all matter.
  • •
    You can visualize double descent by sweeping feature count p from well below to well above the sample size n and plotting test MSE versus p.
  • •
    Regularization strength controls the spike: very small ridge highlights the spike; moderate ridge smooths or removes it.

Prerequisites

  • →Linear algebra (vectors, matrices, eigenvalues) — Understanding least-squares, normal equations, and conditioning is essential for interpreting interpolation and solver stability.
  • →Probability and statistics — Test risk, noise, expectations, and generalization are probabilistic concepts.
  • →Bias–variance trade-off — Double descent extends the classic U-shaped error curve; knowing the decomposition provides context.
  • →Ordinary least squares and ridge regression — Closed-form solutions, hat matrix, and regularization are core to reproducing double descent in linear models.
  • →Numerical linear algebra — Near the interpolation threshold, matrices are ill-conditioned; stable solvers and regularization matter.
  • →Feature engineering and kernels — Random Fourier features approximate kernels and provide controllable complexity by varying p.
  • →Cross-validation and model selection — To choose regularization and detect the second descent reliably, you need sound validation practices.
  • →Gradient-based optimization (optional) — Explains implicit bias toward minimum-norm interpolants in overparameterized settings.

Detailed Explanation

Tap terms for definitions

01Overview

Double descent is a modern view of generalization that extends the classic bias–variance story. Traditionally, as model complexity grows (e.g., polynomial degree increases), test error shows a U-shaped curve: it first decreases (lower bias) and then increases (higher variance/overfitting). Double descent observes that this is not the end of the story. When models are pushed past the ‘interpolation threshold’—the point where they can fit the training data nearly perfectly—the test error can drop again as we add even more parameters. This produces a second descent after a peak around the threshold, hence the name ‘double descent’. The phenomenon has been observed across linear models with random features, kernel methods, and deep neural networks. The key idea is that in the overparameterized regime, learning algorithms often pick special solutions (like minimum-norm interpolants) that can generalize better than expected, especially when the features align with the signal and regularization—either explicit (e.g., ridge) or implicit (from optimization like gradient descent)—guides the solution. Practical implications include why very large models (like modern deep nets) can generalize well, and how regularization strength shapes the error curve near the interpolation threshold. Understanding double descent helps you choose model sizes and regularization in high-capacity settings and avoid misinterpreting spikes in validation error that come from numerical and statistical instability near the threshold.

02Intuition & Analogies

Imagine studying for a test with a small set of practice problems. With a very simple strategy (few rules), you might miss important patterns and do poorly—a high-bias situation. As you add more rules, you capture more structure and improve. If you keep adding rules until you can memorize every practice problem exactly, you reach the point where your ‘model’ can interpolate the study set. Right at that point, you’re brittle: a tiny change in a problem can break your answer, and your test performance can spike worse (variance explodes). Surprisingly, if you keep adding even more flexible tools—say, lots of ways to represent solutions—but you also consistently choose the ‘simplest’ explanation that still fits all practice problems (like preferring the shortest or smoothest solution), your performance on new problems can improve again. You have many ways to fit the data, but you pick the most conservative one that fits—this is the minimum-norm solution. Another analogy: Fitting a curve through points with rubber bands. With too few pegs (parameters), the band can’t reach the points (underfit). As you add pegs, the fit improves. Right when you have just enough pegs to touch every point, a small nudge in one peg makes the band wiggle wildly—instability causes a peak in error. Add a lot more pegs and choose the smoothest way to thread the band through all points; the extra freedom lets you distribute tension evenly, reducing wild wiggles and improving stability, so test error falls again. Regularization acts like adding friction so the band doesn’t oscillate too much, especially near the unstable threshold.

03Formal Definition

Consider data (xi​, yi​)_{i=1}^{n} drawn i.i.d. from a distribution P. Let \hat fm​ be an empirical risk minimizer from a hypothesis class Fm​ of complexity m (e.g., number of parameters p, VC dimension, or effective degrees of freedom). Define test risk R(m) = E(x,y)∼P​[ℓ(\hat fm​(x), y)] for a loss ℓ (e.g., squared loss). Classical theory predicts a U-shaped R(m) due to the bias–variance trade-off. The interpolation threshold m⋆ is the smallest complexity for which there exists f∈Fm​ that interpolates the training data, i.e., achieves empirical risk \hat R=n1​∑i​ ℓ(f(xi​), yi​)=0 (or near-zero with noise). Double descent is the empirical and theoretical observation that R(m) often exhibits: (i) an initial descent for m<m⋆, (ii) a peak (sometimes very sharp) near m⋆ due to high variance and numerical instability, and (iii) a second descent for m>m⋆, especially when the learning algorithm selects minimum-complexity interpolants (e.g., minimum Euclidean norm in linear models) through implicit or explicit regularization. In linear regression with design matrix X∈Rn×p, the interpolation threshold often corresponds to p ≈ n and rank(X)=n; beyond this, the ridgeless solution w†=X†y (Moore–Penrose pseudoinverse) is the minimum-norm interpolant. When features are well-aligned with the target and noise is not dominant, R(p) can decrease again for p≫ n, forming the double descent curve.

04When to Use

  • Very high-capacity models: Deep neural networks, kernel machines, or linear models with many random features. Expect double descent when model size sweeps through the point where training error reaches (near) zero.
  • Feature expansions: Polynomial/radial/random Fourier features where you can vary the number of features p through and beyond the sample size n.
  • Studying regularization effects: To understand how tiny ridge (or implicit bias of gradient descent) influences generalization and stability near the interpolation threshold.
  • Risk diagnostics: When you observe a spike in validation error as you scale parameters, recognize it may be the interpolation peak rather than a monotonic failure of generalization.
  • Algorithm choice: If closed-form solvers are numerically unstable near p\approx n, consider small ridge regularization, better-conditioned features, or iterative solvers (conjugate gradient) that implicitly bias toward minimum-norm solutions.
  • Model selection: Use cross-validation across both under- and overparameterized regimes; don’t stop your search at the first U-shape bottom—check if a second descent exists for larger models, especially with modern architectures.

⚠️Common Mistakes

  • Assuming overparameterization always hurts: In many setups, test error drops again for p\gg n due to implicit bias. Don’t conflate parameter count with poor generalization.
  • Ignoring label noise: With heavy noise, interpolating solutions can generalize poorly; the second descent may be shallow or absent. Always consider noise level when interpreting curves.
  • Misreading training error: Zero training error does not guarantee good or bad generalization; it only indicates you crossed the interpolation threshold.
  • Numerical artifacts mistaken for theory: Spikes can be amplified by ill-conditioned solvers. Use ridge, stable decompositions (QR/SVD/Cholesky with jitter), and consistent feature sampling when plotting curves.
  • Changing data across complexity: If you resample features or data at each complexity, randomness can obscure the shape. Fix a dataset and expand features consistently (prefix features) to reveal true trends.
  • Over-regularizing away the phenomenon: Large ridge can smooth out the spike and mask double descent. That’s useful practically, but know that it changes the qualitative curve.
  • Confusing implicit and explicit regularization: Gradient descent from zero in overparameterized linear models chooses the minimum-norm interpolant even without explicit ridge; this matters for interpreting results.

Key Formulas

Test Risk

R=E(x,y)∼P​[(y−f^​(x))2]

Explanation: This is the expected squared error of the predictor on new data. It captures generalization performance across the data distribution.

Empirical Risk (Squared Loss)

R^=n1​i=1∑n​(yi​−f^​(xi​))2

Explanation: This is the average training error. When it reaches zero (or near zero), the model interpolates the training set.

Bias–Variance Decomposition

E[(f^​(x)−f(x))2]=Bias2+Var+σ2

Explanation: Prediction error decomposes into squared bias, variance, and irreducible noise. The classic U-shape arises from decreasing bias and increasing variance as complexity grows.

Ridge Solution

wλ​=(X⊤X+λI)−1X⊤y

Explanation: This is the closed-form solution for L2-regularized least squares. The λ I term stabilizes inversion and reduces variance, especially near the interpolation threshold.

Minimum-Norm Interpolant

w†=X†y

Explanation: Using the Moore–Penrose pseudoinverse, this gives the minimum Euclidean norm solution that interpolates the data when it is possible to fit exactly.

Hat Matrix (Ridge)

Hλ​=X(X⊤X+λI)−1X⊤

Explanation: This matrix maps training targets y to fitted values \hat y=H_λ y. Its trace df(λ)=tr(H_λ) acts as an effective degree of freedom.

Effective Degrees of Freedom

df(λ)=tr(X(X⊤X+λI)−1X⊤)

Explanation: This scalar measures model flexibility as a function of λ. As λ increases, df(λ) decreases, smoothing the double descent spike.

Interpolation Threshold (Linear)

p⋆≈n(when rank(X)=n)

Explanation: For linear models with full row rank, the threshold often occurs when parameter count p meets sample size n, allowing exact fits.

Condition Number

κ(X⊤X)=σmin2​(X)σmax2​(X)​

Explanation: The ratio of the largest to smallest singular value squared indicates numerical instability. It tends to blow up near the interpolation threshold, amplifying variance.

Aspect Ratio

ϕ=n→∞lim​np​

Explanation: In high-dimensional asymptotics, the feature-to-sample ratio ϕ controls spectral behavior and helps explain risk peaks around ϕ≈ 1.

Kernel Interpolant

f^​(x)=i=1∑n​αi​k(x,xi​)

Explanation: Kernel methods often interpolate with many implicit features. The coefficients α result from solving a linear system in the kernel matrix.

Normal Equations (Ridgeless)

X⊤(Xw−y)=0

Explanation: Stationary condition for least squares without explicit regularization. In overparameterized cases, gradient descent from zero converges to the minimum-norm solution that satisfies these equations.

Complexity Analysis

For the C++ implementations based on ridge-regularized linear regression with explicit feature maps (e.g., random Fourier features), training cost is dominated by forming XT X and solving a p×p linear system. Given n samples, p features, and using dense arithmetic, building X costs O(n p), forming XT X costs O(n p2), and computing XT y costs O(n p). Solving the linear system (XT X + λI) w=XT y via Cholesky decomposition costs O(p3). Thus the overall training time is O(n p2 + p3), and space usage is O(n p + p2) to store X and XT X. When sweeping p across a range to visualize double descent, the runtime scales superlinearly with p due to the O(p3) solver. Prediction on nt​est samples requires forming Xt​est (O(nt​est p)) and multiplying by w (O(nt​est p)). Near the interpolation threshold (p≈n), XT X can be ill-conditioned; numerically, this increases sensitivity to floating-point errors, which may manifest as inflated test error spikes. Adding a small λ (ridge) ensures positive definiteness and stabilizes Cholesky, but very small λ can still lead to loss of accuracy due to large condition numbers. An iterative alternative (e.g., conjugate gradient on the normal equations or gradient descent on the primal) reduces per-iteration cost to O(n p) with memory O(n p), and often exhibits implicit regularization, but may require many iterations to converge near the threshold. If you plan many evaluations over p, reusing pre-sampled features (taking prefixes) reduces variance across runs, clarifying the double descent shape without additional computational overhead. In high dimensions, consider batched formation of XT X to reduce memory pressure, and prefer numerically stable factorizations SVDQR​ when available, albeit at similar or higher asymptotic costs for dense problems.

Code Examples

Double descent with random Fourier features and tiny ridge (sweep features p)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix {
5 int r, c;
6 vector<double> a; // row-major
7 Matrix() : r(0), c(0) {}
8 Matrix(int r_, int c_) : r(r_), c(c_), a(r_*c_, 0.0) {}
9 double &operator()(int i, int j) { return a[i*c + j]; }
10 double operator()(int i, int j) const { return a[i*c + j]; }
11};
12
13Matrix transpose(const Matrix &M) {
14 Matrix T(M.c, M.r);
15 for (int i = 0; i < M.r; ++i)
16 for (int j = 0; j < M.c; ++j)
17 T(j, i) = M(i, j);
18 return T;
19}
20
21Matrix matmul(const Matrix &A, const Matrix &B) {
22 assert(A.c == B.r);
23 Matrix C(A.r, B.c);
24 for (int i = 0; i < A.r; ++i) {
25 for (int k = 0; k < A.c; ++k) {
26 double aik = A(i, k);
27 if (aik == 0) continue;
28 for (int j = 0; j < B.c; ++j) {
29 C(i, j) += aik * B(k, j);
30 }
31 }
32 }
33 return C;
34}
35
36// Solve SPD system A x = b via Cholesky (A is modified locally by copy)
37vector<double> choleskySolve(Matrix A, const vector<double> &b) {
38 int n = A.r;
39 assert(A.r == A.c);
40 // Cholesky decomposition A = L L^T
41 for (int i = 0; i < n; ++i) {
42 for (int j = 0; j <= i; ++j) {
43 double sum = A(i, j);
44 for (int k = 0; k < j; ++k) sum -= A(i, k) * A(j, k);
45 if (i == j) {
46 if (sum <= 1e-18) sum = 1e-18; // jitter for numerical stability
47 A(i, j) = sqrt(sum);
48 } else {
49 A(i, j) = sum / A(j, j);
50 }
51 }
52 for (int j = i + 1; j < n; ++j) A(i, j) = 0.0; // store L in lower triangle
53 }
54 // Forward solve L y = b
55 vector<double> y(n, 0.0);
56 for (int i = 0; i < n; ++i) {
57 double sum = b[i];
58 for (int k = 0; k < i; ++k) sum -= A(i, k) * y[k];
59 y[i] = sum / A(i, i);
60 }
61 // Backward solve L^T x = y
62 vector<double> x(n, 0.0);
63 for (int i = n - 1; i >= 0; --i) {
64 double sum = y[i];
65 for (int k = i + 1; k < n; ++k) sum -= A(k, i) * x[k];
66 x[i] = sum / A(i, i);
67 }
68 return x;
69}
70
71// Build X^T X and X^T y for ridge regression
72void normalEquations(const Matrix &X, const vector<double> &y, Matrix &XtX, vector<double> &Xty, double lambda) {
73 int n = X.r, p = X.c;
74 XtX = Matrix(p, p);
75 Xty.assign(p, 0.0);
76 // Compute XtX and Xty
77 for (int i = 0; i < n; ++i) {
78 for (int j = 0; j < p; ++j) {
79 double xij = X(i, j);
80 Xty[j] += xij * y[i];
81 for (int k = 0; k <= j; ++k) { // fill lower triangle
82 XtX(j, k) += xij * X(i, k);
83 }
84 }
85 }
86 // Symmetrize and add ridge to diagonal
87 for (int j = 0; j < p; ++j) {
88 for (int k = 0; k < j; ++k) XtX(k, j) = XtX(j, k);
89 XtX(j, j) += lambda; // ensure SPD
90 }
91}
92
93// Generate dataset: y = sin(2*pi*x) + noise
94void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed) {
95 mt19937 rng(seed);
96 uniform_real_distribution<double> U(0.0, 1.0);
97 normal_distribution<double> N(0.0, noise_std);
98 x.resize(n); y.resize(n);
99 const double PI = acos(-1.0);
100 for (int i = 0; i < n; ++i) {
101 x[i] = U(rng);
102 y[i] = sin(2.0 * PI * x[i]) + N(rng);
103 }
104}
105
106// Pre-sample random Fourier features parameters
107struct RFFParams {
108 vector<double> w; // frequencies
109 vector<double> b; // phases
110};
111
112RFFParams presample_rff(int p_max, double sigma_w, uint32_t seed) {
113 mt19937 rng(seed);
114 normal_distribution<double> N(0.0, sigma_w);
115 uniform_real_distribution<double> U(0.0, 2.0 * acos(-1.0));
116 RFFParams P; P.w.resize(p_max); P.b.resize(p_max);
117 for (int j = 0; j < p_max; ++j) { P.w[j] = N(rng); P.b[j] = U(rng); }
118 return P;
119}
120
121Matrix build_features(const vector<double> &x, int p, const RFFParams &P) {
122 int n = (int)x.size();
123 Matrix X(n, p);
124 double scale = sqrt(2.0 / max(1, p));
125 for (int i = 0; i < n; ++i) {
126 for (int j = 0; j < p; ++j) {
127 X(i, j) = scale * cos(P.w[j] * x[i] + P.b[j]);
128 }
129 }
130 return X;
131}
132
133double mse(const vector<double> &y, const vector<double> &yhat) {
134 double s = 0.0; int n = (int)y.size();
135 for (int i = 0; i < n; ++i) { double d = yhat[i] - y[i]; s += d * d; }
136 return s / n;
137}
138
139vector<double> predict(const Matrix &X, const vector<double> &w) {
140 int n = X.r, p = X.c; (void)p;
141 vector<double> yhat(n, 0.0);
142 for (int i = 0; i < n; ++i) {
143 double s = 0.0;
144 for (int j = 0; j < (int)w.size(); ++j) s += X(i, j) * w[j];
145 yhat[i] = s;
146 }
147 return yhat;
148}
149
150int main() {
151 ios::sync_with_stdio(false);
152 cin.tie(nullptr);
153
154 const int n_train = 80;
155 const int n_test = 1000;
156 const int p_max = 200; // sweep up to 200 features (over n)
157 const double noise_std = 0.2;
158 const double lambda = 1e-8; // tiny ridge (almost interpolating)
159
160 // Generate fixed train/test splits
161 vector<double> x_tr, y_tr, x_te, y_te;
162 make_dataset(n_train, x_tr, y_tr, noise_std, /*seed=*/123);
163 make_dataset(n_test, x_te, y_te, noise_std, /*seed=*/456);
164
165 // Pre-sample RFF parameters so p is expanded by prefix
166 RFFParams P = presample_rff(p_max, /*sigma_w=*/4.0, /*seed=*/789);
167
168 cout << "#p train_mse test_mse\n";
169 for (int p = 5; p <= p_max; p += 5) { // sweep feature counts
170 Matrix Xtr = build_features(x_tr, p, P);
171 Matrix Xte = build_features(x_te, p, P);
172
173 Matrix XtX; vector<double> Xty;
174 normalEquations(Xtr, y_tr, XtX, Xty, lambda);
175 vector<double> w = choleskySolve(XtX, Xty);
176
177 vector<double> yhat_tr = predict(Xtr, w);
178 vector<double> yhat_te = predict(Xte, w);
179 double tr_mse = mse(y_tr, yhat_tr);
180 double te_mse = mse(y_te, yhat_te);
181 cout << p << " " << tr_mse << " " << te_mse << "\n";
182 }
183 return 0;
184}
185

This program generates a 1D noisy sinusoidal dataset, builds random Fourier features (an explicit high-dimensional feature map), and trains ridge regression for different numbers of features p. Using a tiny ridge (lambda=1e-8) approximates the ridgeless, near-interpolating regime. As p sweeps from well below n to well above n, printing train/test MSE typically reveals a U-shaped region, a spike around p≈n (interpolation threshold), and a second descent for larger p—double descent.

Time: O(n p^2 + p^3) per p value (building XtX and solving via Cholesky).Space: O(n p + p^2).
Effect of regularization on the double descent spike
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix { int r,c; vector<double> a; Matrix():r(0),c(0){} Matrix(int r_,int c_):r(r_),c(c_),a(r_*c_,0.0){} double &operator()(int i,int j){return a[i*c+j];} double operator()(int i,int j)const{return a[i*c+j];}};
5
6Matrix transpose(const Matrix &M){ Matrix T(M.c,M.r); for(int i=0;i<M.r;++i) for(int j=0;j<M.c;++j) T(j,i)=M(i,j); return T; }
7
8vector<double> choleskySolve(Matrix A, const vector<double> &b){ int n=A.r; assert(A.r==A.c); for(int i=0;i<n;++i){ for(int j=0;j<=i;++j){ double sum=A(i,j); for(int k=0;k<j;++k) sum-=A(i,k)*A(j,k); if(i==j){ if(sum<=1e-18) sum=1e-18; A(i,j)=sqrt(sum);} else { A(i,j)=sum/A(j,j);} } for(int j=i+1;j<n;++j) A(i,j)=0.0; } vector<double> y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=A(i,k)*y[k]; y[i]=s/A(i,i);} vector<double> x(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=A(k,i)*x[k]; x[i]=s/A(i,i);} return x; }
9
10struct RFFParams{ vector<double> w,b; };
11RFFParams presample_rff(int p_max,double sigma_w,uint32_t seed){ mt19937 rng(seed); normal_distribution<double> N(0.0,sigma_w); uniform_real_distribution<double> U(0.0,2.0*acos(-1.0)); RFFParams P; P.w.resize(p_max); P.b.resize(p_max); for(int j=0;j<p_max;++j){ P.w[j]=N(rng); P.b[j]=U(rng);} return P; }
12
13Matrix build_features(const vector<double> &x,int p,const RFFParams &P){ int n=(int)x.size(); Matrix X(n,p); double scale=sqrt(2.0/max(1,p)); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ X(i,j)=scale*cos(P.w[j]*x[i]+P.b[j]); } } return X; }
14
15void normalEquations(const Matrix &X,const vector<double> &y,Matrix &XtX,vector<double> &Xty,double lambda){ int n=X.r,p=X.c; XtX=Matrix(p,p); Xty.assign(p,0.0); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ double xij=X(i,j); Xty[j]+=xij*y[i]; for(int k=0;k<=j;++k){ XtX(j,k)+=xij*X(i,k); } } } for(int j=0;j<p;++j){ for(int k=0;k<j;++k) XtX(k,j)=XtX(j,k); XtX(j,j)+=lambda; } }
16
17void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed){ mt19937 rng(seed); uniform_real_distribution<double> U(0.0,1.0); normal_distribution<double> N(0.0,noise_std); x.resize(n); y.resize(n); const double PI=acos(-1.0); for(int i=0;i<n;++i){ x[i]=U(rng); y[i]=sin(2.0*PI*x[i])+N(rng);} }
18
19vector<double> predict(const Matrix &X,const vector<double> &w){ int n=X.r; vector<double> yhat(n,0.0); for(int i=0;i<n;++i){ double s=0.0; for(int j=0;j<(int)w.size();++j) s+=X(i,j)*w[j]; yhat[i]=s; } return yhat; }
20
21double mse(const vector<double> &y,const vector<double> &yhat){ double s=0.0; for(size_t i=0;i<y.size();++i){ double d=yhat[i]-y[i]; s+=d*d; } return s/y.size(); }
22
23int main(){ ios::sync_with_stdio(false); cin.tie(nullptr);
24 const int n_train=80, n_test=1000, p_max=200; const double noise_std=0.2;
25 vector<double> x_tr,y_tr,x_te,y_te; make_dataset(n_train,x_tr,y_tr,noise_std,123); make_dataset(n_test,x_te,y_te,noise_std,456);
26 RFFParams P = presample_rff(p_max,4.0,789);
27 cout << "#p test_mse_lambda1e-8 test_mse_lambda1e-2\n";
28 for(int p=5;p<=p_max;p+=5){ Matrix Xtr=build_features(x_tr,p,P), Xte=build_features(x_te,p,P);
29 // tiny ridge (near-interpolating)
30 Matrix A1; vector<double> b1; normalEquations(Xtr,y_tr,A1,b1,1e-8); vector<double> w1=choleskySolve(A1,b1);
31 double te1 = mse(y_te, predict(Xte,w1));
32 // moderate ridge (smoother)
33 Matrix A2; vector<double> b2; normalEquations(Xtr,y_tr,A2,b2,1e-2); vector<double> w2=choleskySolve(A2,b2);
34 double te2 = mse(y_te, predict(Xte,w2));
35 cout << p << " " << te1 << " " << te2 << "\n";
36 }
37 return 0; }
38

This program compares test MSE across feature counts p for two ridge strengths: λ=1e-8 (near-interpolating) and λ=1e-2 (moderately regularized). You should observe a pronounced spike near p≈n for tiny λ and a much smoother, often spike-free curve for the larger λ. This shows how explicit regularization shapes and can suppress the double descent peak.

Time: O(n p^2 + p^3) per p value for each λ (two solves per p).Space: O(n p + p^2).
#double descent#interpolation threshold#overparameterization#minimum-norm interpolant#ridge regression#bias variance tradeoff#random fourier features#benign overfitting#condition number#hat matrix#degrees of freedom#kernel methods#linear regression#generalization#implicit bias