📚TheoryIntermediate

Double Descent Phenomenon

Key Points

•
Double descent describes how test error first follows the classic U-shape with increasing model complexity, spikes near the interpolation threshold, and then drops again in the highly overparameterized regime.
•
The interpolation threshold occurs when the model has just enough parameters to fit (interpolate) the training data, often when the number of parameters roughly equals the number of training samples.
•
In overparameterized settings, implicit or explicit regularization often selects a minimum-norm interpolating solution that can generalize surprisingly well (benign overfitting).
•
Linear models with random features and tiny ridge regularization often exhibit clear double descent as we increase the number of features through and beyond the sample size.
•
Numerical instability near the interpolation threshold can inflate test error and must be handled with care in implementations (e.g., by ridge or better solvers).
•
Double descent does not mean ‘more parameters is always better’; label noise, data distribution, and optimization bias all matter.
•
You can visualize double descent by sweeping feature count p from well below to well above the sample size n and plotting test MSE versus p.
•
Regularization strength controls the spike: very small ridge highlights the spike; moderate ridge smooths or removes it.

Prerequisites

→Linear algebra (vectors, matrices, eigenvalues) — Understanding least-squares, normal equations, and conditioning is essential for interpreting interpolation and solver stability.
→Probability and statistics — Test risk, noise, expectations, and generalization are probabilistic concepts.
→Bias–variance trade-off — Double descent extends the classic U-shaped error curve; knowing the decomposition provides context.
→Ordinary least squares and ridge regression — Closed-form solutions, hat matrix, and regularization are core to reproducing double descent in linear models.
→Numerical linear algebra — Near the interpolation threshold, matrices are ill-conditioned; stable solvers and regularization matter.
→Feature engineering and kernels — Random Fourier features approximate kernels and provide controllable complexity by varying p.
→Cross-validation and model selection — To choose regularization and detect the second descent reliably, you need sound validation practices.
→Gradient-based optimization (optional) — Explains implicit bias toward minimum-norm interpolants in overparameterized settings.

Detailed Explanation

Tap terms for definitions

01Overview

Double descent is a modern view of generalization that extends the classic bias–variance story. Traditionally, as model complexity grows (e.g., polynomial degree increases), test error shows a U-shaped curve: it first decreases (lower bias) and then increases (higher variance/overfitting). Double descent observes that this is not the end of the story. When models are pushed past the ‘interpolation threshold’—the point where they can fit the training data nearly perfectly—the test error can drop again as we add even more parameters. This produces a second descent after a peak around the threshold, hence the name ‘double descent’. The phenomenon has been observed across linear models with random features, kernel methods, and deep neural networks. The key idea is that in the overparameterized regime, learning algorithms often pick special solutions (like minimum-norm interpolants) that can generalize better than expected, especially when the features align with the signal and regularization—either explicit (e.g., ridge) or implicit (from optimization like gradient descent)—guides the solution. Practical implications include why very large models (like modern deep nets) can generalize well, and how regularization strength shapes the error curve near the interpolation threshold. Understanding double descent helps you choose model sizes and regularization in high-capacity settings and avoid misinterpreting spikes in validation error that come from numerical and statistical instability near the threshold.

02Intuition & Analogies

Imagine studying for a test with a small set of practice problems. With a very simple strategy (few rules), you might miss important patterns and do poorly—a high-bias situation. As you add more rules, you capture more structure and improve. If you keep adding rules until you can memorize every practice problem exactly, you reach the point where your ‘model’ can interpolate the study set. Right at that point, you’re brittle: a tiny change in a problem can break your answer, and your test performance can spike worse (variance explodes). Surprisingly, if you keep adding even more flexible tools—say, lots of ways to represent solutions—but you also consistently choose the ‘simplest’ explanation that still fits all practice problems (like preferring the shortest or smoothest solution), your performance on new problems can improve again. You have many ways to fit the data, but you pick the most conservative one that fits—this is the minimum-norm solution. Another analogy: Fitting a curve through points with rubber bands. With too few pegs (parameters), the band can’t reach the points (underfit). As you add pegs, the fit improves. Right when you have just enough pegs to touch every point, a small nudge in one peg makes the band wiggle wildly—instability causes a peak in error. Add a lot more pegs and choose the smoothest way to thread the band through all points; the extra freedom lets you distribute tension evenly, reducing wild wiggles and improving stability, so test error falls again. Regularization acts like adding friction so the band doesn’t oscillate too much, especially near the unstable threshold.

03Formal Definition

Consider data (

x_{i}

y_{i}

)_{

i=1}^{n}

drawn i.i.d. from a distribution P. Let

\hat

f_{m}

be an empirical risk minimizer from a hypothesis class

F_{m}

of complexity m (e.g., number of parameters p, VC dimension, or effective degrees of freedom). Define test risk R(m) =

E_{(x, y) \sim P}

[

ℓ

(

\hat

f_{m}

(x), y)] for a loss

ℓ

(e.g., squared loss). Classical theory predicts a U-shaped R(m) due to the bias–variance trade-off. The interpolation threshold

m^{⋆}

is the smallest complexity for which there exists f

\in F_{m}

that interpolates the training data, i.e., achieves empirical risk

\hat

\frac{1}{n} \sum_{i}

ℓ

(f(

x_{i}

y_{i}

)=0 (or near-zero with noise). Double descent is the empirical and theoretical observation that R(m) often exhibits: (i) an initial descent for

m < m^{⋆}

, (ii) a peak (sometimes very sharp) near

m^{⋆}

due to high variance and numerical instability, and (iii) a second descent for

m > m^{⋆}

, especially when the learning algorithm selects minimum-complexity interpolants (e.g., minimum Euclidean norm in linear models) through implicit or explicit regularization. In linear regression with design matrix X

\in R^{n \times p}

, the interpolation threshold often corresponds to p

\approx

n and

rank

(X)=n; beyond this, the ridgeless solution

w^{†} = X^{†} y

(Moore–Penrose pseudoinverse) is the minimum-norm interpolant. When features are well-aligned with the target and noise is not dominant, R(p) can decrease again for p

≫

n, forming the double descent curve.

04When to Use

Very high-capacity models: Deep neural networks, kernel machines, or linear models with many random features. Expect double descent when model size sweeps through the point where training error reaches (near) zero.
Feature expansions: Polynomial/radial/random Fourier features where you can vary the number of features p through and beyond the sample size n.
Studying regularization effects: To understand how tiny ridge (or implicit bias of gradient descent) influences generalization and stability near the interpolation threshold.
Risk diagnostics: When you observe a spike in validation error as you scale parameters, recognize it may be the interpolation peak rather than a monotonic failure of generalization.
Algorithm choice: If closed-form solvers are numerically unstable near p\approx n, consider small ridge regularization, better-conditioned features, or iterative solvers (conjugate gradient) that implicitly bias toward minimum-norm solutions.
Model selection: Use cross-validation across both under- and overparameterized regimes; don’t stop your search at the first U-shape bottom—check if a second descent exists for larger models, especially with modern architectures.

⚠️Common Mistakes

Assuming overparameterization always hurts: In many setups, test error drops again for p\gg n due to implicit bias. Don’t conflate parameter count with poor generalization.
Ignoring label noise: With heavy noise, interpolating solutions can generalize poorly; the second descent may be shallow or absent. Always consider noise level when interpreting curves.
Misreading training error: Zero training error does not guarantee good or bad generalization; it only indicates you crossed the interpolation threshold.
Numerical artifacts mistaken for theory: Spikes can be amplified by ill-conditioned solvers. Use ridge, stable decompositions (QR/SVD/Cholesky with jitter), and consistent feature sampling when plotting curves.
Changing data across complexity: If you resample features or data at each complexity, randomness can obscure the shape. Fix a dataset and expand features consistently (prefix features) to reveal true trends.
Over-regularizing away the phenomenon: Large ridge can smooth out the spike and mask double descent. That’s useful practically, but know that it changes the qualitative curve.
Confusing implicit and explicit regularization: Gradient descent from zero in overparameterized linear models chooses the minimum-norm interpolant even without explicit ridge; this matters for interpreting results.

Key Formulas

Test Risk

R = E_{(x, y) \sim P} [(y - \hat{f} (x))^{2}]

Explanation: This is the expected squared error of the predictor on new data. It captures generalization performance across the data distribution.

Empirical Risk (Squared Loss)

\hat{R} = \frac{1}{n} i = 1 \sum n (y_{i} - \hat{f} (x_{i}))^{2}

Explanation: This is the average training error. When it reaches zero (or near zero), the model interpolates the training set.

Bias–Variance Decomposition

E [(\hat{f} (x) - f (x))^{2}] = Bias^{2} + Var + σ^{2}

Explanation: Prediction error decomposes into squared bias, variance, and irreducible noise. The classic U-shape arises from decreasing bias and increasing variance as complexity grows.

Ridge Solution

w_{λ} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y

Explanation: This is the closed-form solution for L2-regularized least squares. The $λ$ I term stabilizes inversion and reduces variance, especially near the interpolation threshold.

Minimum-Norm Interpolant

w^{†} = X^{†} y

Explanation: Using the Moore–Penrose pseudoinverse, this gives the minimum Euclidean norm solution that interpolates the data when it is possible to fit exactly.

Hat Matrix (Ridge)

H_{λ} = X (X^{⊤} X + λ I)^{- 1} X^{⊤}

Explanation: This matrix maps training targets y to fitted values $\hat$ y=H_ $λ$ y. Its trace df( $λ$ )= $tr$ (H_ $λ$ ) acts as an effective degree of freedom.

Effective Degrees of Freedom

df (λ) = tr (X (X^{⊤} X + λ I)^{- 1} X^{⊤})

Explanation: This scalar measures model flexibility as a function of $λ$ . As $λ$ increases, df( $λ$ ) decreases, smoothing the double descent spike.

Interpolation Threshold (Linear)

p^{⋆} \approx n (when rank (X) = n)

Explanation: For linear models with full row rank, the threshold often occurs when parameter count p meets sample size n, allowing exact fits.

Condition Number

κ (X^{⊤} X) = \frac{σ _{m a x}^{2} ( X )}{σ _{m i n}^{2} ( X )}

Explanation: The ratio of the largest to smallest singular value squared indicates numerical instability. It tends to blow up near the interpolation threshold, amplifying variance.

Aspect Ratio

ϕ = n \to \infty lim \frac{p}{n}

Explanation: In high-dimensional asymptotics, the feature-to-sample ratio $ϕ$ controls spectral behavior and helps explain risk peaks around $ϕ \approx$ 1.

Kernel Interpolant

\hat{f} (x) = i = 1 \sum n α_{i} k (x, x_{i})

Explanation: Kernel methods often interpolate with many implicit features. The coefficients $α$ result from solving a linear system in the kernel matrix.

Normal Equations (Ridgeless)

X^{⊤} (Xw - y) = 0

Explanation: Stationary condition for least squares without explicit regularization. In overparameterized cases, gradient descent from zero converges to the minimum-norm solution that satisfies these equations.

Complexity Analysis

For the C++ implementations based on ridge-regularized linear regression with explicit feature maps (e.g., random Fourier features), training cost is dominated by forming

X^{T}

X and solving a p×p linear system. Given n samples, p features, and using dense arithmetic, building X costs O(n p), forming

X^{T}

X costs O(n

p^{2}

), and computing

X^{T}

y costs O(n p). Solving the linear system (

X^{T}

X +

λ I)

w = X^{T}

y via Cholesky decomposition costs O(

p^{3}

). Thus the overall training time is O(n

p^{2}

p^{3}

), and space usage is O(n p +

p^{2}

) to store X and

X^{T}

X. When sweeping p across a range to visualize double descent, the runtime scales superlinearly with p due to the O(

p^{3}

) solver. Prediction on

n_{t} es t

samples requires forming

X_{t} es t

(O(

n_{t} es t

p)) and multiplying by w (O(

n_{t} es t

p)). Near the interpolation threshold (p≈n),

X^{T}

X can be ill-conditioned; numerically, this increases sensitivity to floating-point errors, which may manifest as inflated test error spikes. Adding a small λ (ridge) ensures positive definiteness and stabilizes Cholesky, but very small λ can still lead to loss of accuracy due to large condition numbers. An iterative alternative (e.g., conjugate gradient on the normal equations or gradient descent on the primal) reduces per-iteration cost to O(n p) with memory O(n p), and often exhibits implicit regularization, but may require many iterations to converge near the threshold. If you plan many evaluations over p, reusing pre-sampled features (taking prefixes) reduces variance across runs, clarifying the double descent shape without additional computational overhead. In high dimensions, consider batched formation of

X^{T}

X to reduce memory pressure, and prefer numerically stable factorizations

\frac{QR}{S V D}

when available, albeit at similar or higher asymptotic costs for dense problems.

Code Examples

Double descent with random Fourier features and tiny ridge (sweep features p)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix {
5     int r, c;
6     vector<double> a; // row-major
7     Matrix() : r(0), c(0) {}
8     Matrix(int r_, int c_) : r(r_), c(c_), a(r_*c_, 0.0) {}
9     double &operator()(int i, int j) { return a[i*c + j]; }
10     double operator()(int i, int j) const { return a[i*c + j]; }
11 };
12 
13 Matrix transpose(const Matrix &M) {
14     Matrix T(M.c, M.r);
15     for (int i = 0; i < M.r; ++i)
16         for (int j = 0; j < M.c; ++j)
17             T(j, i) = M(i, j);
18     return T;
19 }
20 
21 Matrix matmul(const Matrix &A, const Matrix &B) {
22     assert(A.c == B.r);
23     Matrix C(A.r, B.c);
24     for (int i = 0; i < A.r; ++i) {
25         for (int k = 0; k < A.c; ++k) {
26             double aik = A(i, k);
27             if (aik == 0) continue;
28             for (int j = 0; j < B.c; ++j) {
29                 C(i, j) += aik * B(k, j);
30             }
31         }
32     }
33     return C;
34 }
35 
36 // Solve SPD system A x = b via Cholesky (A is modified locally by copy)
37 vector<double> choleskySolve(Matrix A, const vector<double> &b) {
38     int n = A.r;
39     assert(A.r == A.c);
40     // Cholesky decomposition A = L L^T
41     for (int i = 0; i < n; ++i) {
42         for (int j = 0; j <= i; ++j) {
43             double sum = A(i, j);
44             for (int k = 0; k < j; ++k) sum -= A(i, k) * A(j, k);
45             if (i == j) {
46                 if (sum <= 1e-18) sum = 1e-18; // jitter for numerical stability
47                 A(i, j) = sqrt(sum);
48             } else {
49                 A(i, j) = sum / A(j, j);
50             }
51         }
52         for (int j = i + 1; j < n; ++j) A(i, j) = 0.0; // store L in lower triangle
53     }
54     // Forward solve L y = b
55     vector<double> y(n, 0.0);
56     for (int i = 0; i < n; ++i) {
57         double sum = b[i];
58         for (int k = 0; k < i; ++k) sum -= A(i, k) * y[k];
59         y[i] = sum / A(i, i);
60     }
61     // Backward solve L^T x = y
62     vector<double> x(n, 0.0);
63     for (int i = n - 1; i >= 0; --i) {
64         double sum = y[i];
65         for (int k = i + 1; k < n; ++k) sum -= A(k, i) * x[k];
66         x[i] = sum / A(i, i);
67     }
68     return x;
69 }
70 
71 // Build X^T X and X^T y for ridge regression
72 void normalEquations(const Matrix &X, const vector<double> &y, Matrix &XtX, vector<double> &Xty, double lambda) {
73     int n = X.r, p = X.c;
74     XtX = Matrix(p, p);
75     Xty.assign(p, 0.0);
76     // Compute XtX and Xty
77     for (int i = 0; i < n; ++i) {
78         for (int j = 0; j < p; ++j) {
79             double xij = X(i, j);
80             Xty[j] += xij * y[i];
81             for (int k = 0; k <= j; ++k) { // fill lower triangle
82                 XtX(j, k) += xij * X(i, k);
83             }
84         }
85     }
86     // Symmetrize and add ridge to diagonal
87     for (int j = 0; j < p; ++j) {
88         for (int k = 0; k < j; ++k) XtX(k, j) = XtX(j, k);
89         XtX(j, j) += lambda; // ensure SPD
90     }
91 }
92 
93 // Generate dataset: y = sin(2*pi*x) + noise
94 void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed) {
95     mt19937 rng(seed);
96     uniform_real_distribution<double> U(0.0, 1.0);
97     normal_distribution<double> N(0.0, noise_std);
98     x.resize(n); y.resize(n);
99     const double PI = acos(-1.0);
100     for (int i = 0; i < n; ++i) {
101         x[i] = U(rng);
102         y[i] = sin(2.0 * PI * x[i]) + N(rng);
103     }
104 }
105 
106 // Pre-sample random Fourier features parameters
107 struct RFFParams {
108     vector<double> w; // frequencies
109     vector<double> b; // phases
110 };
111 
112 RFFParams presample_rff(int p_max, double sigma_w, uint32_t seed) {
113     mt19937 rng(seed);
114     normal_distribution<double> N(0.0, sigma_w);
115     uniform_real_distribution<double> U(0.0, 2.0 * acos(-1.0));
116     RFFParams P; P.w.resize(p_max); P.b.resize(p_max);
117     for (int j = 0; j < p_max; ++j) { P.w[j] = N(rng); P.b[j] = U(rng); }
118     return P;
119 }
120 
121 Matrix build_features(const vector<double> &x, int p, const RFFParams &P) {
122     int n = (int)x.size();
123     Matrix X(n, p);
124     double scale = sqrt(2.0 / max(1, p));
125     for (int i = 0; i < n; ++i) {
126         for (int j = 0; j < p; ++j) {
127             X(i, j) = scale * cos(P.w[j] * x[i] + P.b[j]);
128         }
129     }
130     return X;
131 }
132 
133 double mse(const vector<double> &y, const vector<double> &yhat) {
134     double s = 0.0; int n = (int)y.size();
135     for (int i = 0; i < n; ++i) { double d = yhat[i] - y[i]; s += d * d; }
136     return s / n;
137 }
138 
139 vector<double> predict(const Matrix &X, const vector<double> &w) {
140     int n = X.r, p = X.c; (void)p;
141     vector<double> yhat(n, 0.0);
142     for (int i = 0; i < n; ++i) {
143         double s = 0.0;
144         for (int j = 0; j < (int)w.size(); ++j) s += X(i, j) * w[j];
145         yhat[i] = s;
146     }
147     return yhat;
148 }
149 
150 int main() {
151     ios::sync_with_stdio(false);
152     cin.tie(nullptr);
153 
154     const int n_train = 80;
155     const int n_test = 1000;
156     const int p_max = 200; // sweep up to 200 features (over n)
157     const double noise_std = 0.2;
158     const double lambda = 1e-8; // tiny ridge (almost interpolating)
159 
160     // Generate fixed train/test splits
161     vector<double> x_tr, y_tr, x_te, y_te;
162     make_dataset(n_train, x_tr, y_tr, noise_std, /*seed=*/123);
163     make_dataset(n_test,  x_te, y_te,  noise_std, /*seed=*/456);
164 
165     // Pre-sample RFF parameters so p is expanded by prefix
166     RFFParams P = presample_rff(p_max, /*sigma_w=*/4.0, /*seed=*/789);
167 
168     cout << "#p train_mse test_mse\n";
169     for (int p = 5; p <= p_max; p += 5) { // sweep feature counts
170         Matrix Xtr = build_features(x_tr, p, P);
171         Matrix Xte = build_features(x_te, p, P);
172 
173         Matrix XtX; vector<double> Xty;
174         normalEquations(Xtr, y_tr, XtX, Xty, lambda);
175         vector<double> w = choleskySolve(XtX, Xty);
176 
177         vector<double> yhat_tr = predict(Xtr, w);
178         vector<double> yhat_te = predict(Xte, w);
179         double tr_mse = mse(y_tr, yhat_tr);
180         double te_mse = mse(y_te, yhat_te);
181         cout << p << " " << tr_mse << " " << te_mse << "\n";
182     }
183     return 0;
184 }
185

This program generates a 1D noisy sinusoidal dataset, builds random Fourier features (an explicit high-dimensional feature map), and trains ridge regression for different numbers of features p. Using a tiny ridge (lambda=1e-8) approximates the ridgeless, near-interpolating regime. As p sweeps from well below n to well above n, printing train/test MSE typically reveals a U-shaped region, a spike around p≈n (interpolation threshold), and a second descent for larger p—double descent.

Time: O(n p^2 + p^3) per p value (building XtX and solving via Cholesky).Space: O(n p + p^2).

Effect of regularization on the double descent spike

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix { int r,c; vector<double> a; Matrix():r(0),c(0){} Matrix(int r_,int c_):r(r_),c(c_),a(r_*c_,0.0){} double &operator()(int i,int j){return a[i*c+j];} double operator()(int i,int j)const{return a[i*c+j];}};
5 
6 Matrix transpose(const Matrix &M){ Matrix T(M.c,M.r); for(int i=0;i<M.r;++i) for(int j=0;j<M.c;++j) T(j,i)=M(i,j); return T; }
7 
8 vector<double> choleskySolve(Matrix A, const vector<double> &b){ int n=A.r; assert(A.r==A.c); for(int i=0;i<n;++i){ for(int j=0;j<=i;++j){ double sum=A(i,j); for(int k=0;k<j;++k) sum-=A(i,k)*A(j,k); if(i==j){ if(sum<=1e-18) sum=1e-18; A(i,j)=sqrt(sum);} else { A(i,j)=sum/A(j,j);} } for(int j=i+1;j<n;++j) A(i,j)=0.0; } vector<double> y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=A(i,k)*y[k]; y[i]=s/A(i,i);} vector<double> x(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=A(k,i)*x[k]; x[i]=s/A(i,i);} return x; }
9 
10 struct RFFParams{ vector<double> w,b; };
11 RFFParams presample_rff(int p_max,double sigma_w,uint32_t seed){ mt19937 rng(seed); normal_distribution<double> N(0.0,sigma_w); uniform_real_distribution<double> U(0.0,2.0*acos(-1.0)); RFFParams P; P.w.resize(p_max); P.b.resize(p_max); for(int j=0;j<p_max;++j){ P.w[j]=N(rng); P.b[j]=U(rng);} return P; }
12 
13 Matrix build_features(const vector<double> &x,int p,const RFFParams &P){ int n=(int)x.size(); Matrix X(n,p); double scale=sqrt(2.0/max(1,p)); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ X(i,j)=scale*cos(P.w[j]*x[i]+P.b[j]); } } return X; }
14 
15 void normalEquations(const Matrix &X,const vector<double> &y,Matrix &XtX,vector<double> &Xty,double lambda){ int n=X.r,p=X.c; XtX=Matrix(p,p); Xty.assign(p,0.0); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ double xij=X(i,j); Xty[j]+=xij*y[i]; for(int k=0;k<=j;++k){ XtX(j,k)+=xij*X(i,k); } } } for(int j=0;j<p;++j){ for(int k=0;k<j;++k) XtX(k,j)=XtX(j,k); XtX(j,j)+=lambda; } }
16 
17 void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed){ mt19937 rng(seed); uniform_real_distribution<double> U(0.0,1.0); normal_distribution<double> N(0.0,noise_std); x.resize(n); y.resize(n); const double PI=acos(-1.0); for(int i=0;i<n;++i){ x[i]=U(rng); y[i]=sin(2.0*PI*x[i])+N(rng);} }
18 
19 vector<double> predict(const Matrix &X,const vector<double> &w){ int n=X.r; vector<double> yhat(n,0.0); for(int i=0;i<n;++i){ double s=0.0; for(int j=0;j<(int)w.size();++j) s+=X(i,j)*w[j]; yhat[i]=s; } return yhat; }
20 
21 double mse(const vector<double> &y,const vector<double> &yhat){ double s=0.0; for(size_t i=0;i<y.size();++i){ double d=yhat[i]-y[i]; s+=d*d; } return s/y.size(); }
22 
23 int main(){ ios::sync_with_stdio(false); cin.tie(nullptr);
24     const int n_train=80, n_test=1000, p_max=200; const double noise_std=0.2;
25     vector<double> x_tr,y_tr,x_te,y_te; make_dataset(n_train,x_tr,y_tr,noise_std,123); make_dataset(n_test,x_te,y_te,noise_std,456);
26     RFFParams P = presample_rff(p_max,4.0,789);
27     cout << "#p test_mse_lambda1e-8 test_mse_lambda1e-2\n";
28     for(int p=5;p<=p_max;p+=5){ Matrix Xtr=build_features(x_tr,p,P), Xte=build_features(x_te,p,P);
29         // tiny ridge (near-interpolating)
30         Matrix A1; vector<double> b1; normalEquations(Xtr,y_tr,A1,b1,1e-8); vector<double> w1=choleskySolve(A1,b1);
31         double te1 = mse(y_te, predict(Xte,w1));
32         // moderate ridge (smoother)
33         Matrix A2; vector<double> b2; normalEquations(Xtr,y_tr,A2,b2,1e-2); vector<double> w2=choleskySolve(A2,b2);
34         double te2 = mse(y_te, predict(Xte,w2));
35         cout << p << " " << te1 << " " << te2 << "\n";
36     }
37     return 0; }
38

This program compares test MSE across feature counts p for two ridge strengths: λ=1e-8 (near-interpolating) and λ=1e-2 (moderately regularized). You should observe a pronounced spike near p≈n for tiny λ and a much smoother, often spike-free curve for the larger λ. This shows how explicit regularization shapes and can suppress the double descent peak.

Time: O(n p^2 + p^3) per p value for each λ (two solves per p).Space: O(n p + p^2).

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix {
5	int r, c;
6	vector<double> a; // row-major
7	Matrix() : r(0), c(0) {}
8	Matrix(int r_, int c_) : r(r_), c(c_), a(r_*c_, 0.0) {}
9	double &operator()(int i, int j) { return a[i*c + j]; }
10	double operator()(int i, int j) const { return a[i*c + j]; }
11	};
12
13	Matrix transpose(const Matrix &M) {
14	Matrix T(M.c, M.r);
15	for (int i = 0; i < M.r; ++i)
16	for (int j = 0; j < M.c; ++j)
17	T(j, i) = M(i, j);
18	return T;
19	}
20
21	Matrix matmul(const Matrix &A, const Matrix &B) {
22	assert(A.c == B.r);
23	Matrix C(A.r, B.c);
24	for (int i = 0; i < A.r; ++i) {
25	for (int k = 0; k < A.c; ++k) {
26	double aik = A(i, k);
27	if (aik == 0) continue;
28	for (int j = 0; j < B.c; ++j) {
29	C(i, j) += aik * B(k, j);
30	}
31	}
32	}
33	return C;
34	}
35
36	// Solve SPD system A x = b via Cholesky (A is modified locally by copy)
37	vector<double> choleskySolve(Matrix A, const vector<double> &b) {
38	int n = A.r;
39	assert(A.r == A.c);
40	// Cholesky decomposition A = L L^T
41	for (int i = 0; i < n; ++i) {
42	for (int j = 0; j <= i; ++j) {
43	double sum = A(i, j);
44	for (int k = 0; k < j; ++k) sum -= A(i, k) * A(j, k);
45	if (i == j) {
46	if (sum <= 1e-18) sum = 1e-18; // jitter for numerical stability
47	A(i, j) = sqrt(sum);
48	} else {
49	A(i, j) = sum / A(j, j);
50	}
51	}
52	for (int j = i + 1; j < n; ++j) A(i, j) = 0.0; // store L in lower triangle
53	}
54	// Forward solve L y = b
55	vector<double> y(n, 0.0);
56	for (int i = 0; i < n; ++i) {
57	double sum = b[i];
58	for (int k = 0; k < i; ++k) sum -= A(i, k) * y[k];
59	y[i] = sum / A(i, i);
60	}
61	// Backward solve L^T x = y
62	vector<double> x(n, 0.0);
63	for (int i = n - 1; i >= 0; --i) {
64	double sum = y[i];
65	for (int k = i + 1; k < n; ++k) sum -= A(k, i) * x[k];
66	x[i] = sum / A(i, i);
67	}
68	return x;
69	}
70
71	// Build X^T X and X^T y for ridge regression
72	void normalEquations(const Matrix &X, const vector<double> &y, Matrix &XtX, vector<double> &Xty, double lambda) {
73	int n = X.r, p = X.c;
74	XtX = Matrix(p, p);
75	Xty.assign(p, 0.0);
76	// Compute XtX and Xty
77	for (int i = 0; i < n; ++i) {
78	for (int j = 0; j < p; ++j) {
79	double xij = X(i, j);
80	Xty[j] += xij * y[i];
81	for (int k = 0; k <= j; ++k) { // fill lower triangle
82	XtX(j, k) += xij * X(i, k);
83	}
84	}
85	}
86	// Symmetrize and add ridge to diagonal
87	for (int j = 0; j < p; ++j) {
88	for (int k = 0; k < j; ++k) XtX(k, j) = XtX(j, k);
89	XtX(j, j) += lambda; // ensure SPD
90	}
91	}
92
93	// Generate dataset: y = sin(2pix) + noise
94	void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed) {
95	mt19937 rng(seed);
96	uniform_real_distribution<double> U(0.0, 1.0);
97	normal_distribution<double> N(0.0, noise_std);
98	x.resize(n); y.resize(n);
99	const double PI = acos(-1.0);
100	for (int i = 0; i < n; ++i) {
101	x[i] = U(rng);
102	y[i] = sin(2.0 * PI * x[i]) + N(rng);
103	}
104	}
105
106	// Pre-sample random Fourier features parameters
107	struct RFFParams {
108	vector<double> w; // frequencies
109	vector<double> b; // phases
110	};
111
112	RFFParams presample_rff(int p_max, double sigma_w, uint32_t seed) {
113	mt19937 rng(seed);
114	normal_distribution<double> N(0.0, sigma_w);
115	uniform_real_distribution<double> U(0.0, 2.0 * acos(-1.0));
116	RFFParams P; P.w.resize(p_max); P.b.resize(p_max);
117	for (int j = 0; j < p_max; ++j) { P.w[j] = N(rng); P.b[j] = U(rng); }
118	return P;
119	}
120
121	Matrix build_features(const vector<double> &x, int p, const RFFParams &P) {
122	int n = (int)x.size();
123	Matrix X(n, p);
124	double scale = sqrt(2.0 / max(1, p));
125	for (int i = 0; i < n; ++i) {
126	for (int j = 0; j < p; ++j) {
127	X(i, j) = scale * cos(P.w[j] * x[i] + P.b[j]);
128	}
129	}
130	return X;
131	}
132
133	double mse(const vector<double> &y, const vector<double> &yhat) {
134	double s = 0.0; int n = (int)y.size();
135	for (int i = 0; i < n; ++i) { double d = yhat[i] - y[i]; s += d * d; }
136	return s / n;
137	}
138
139	vector<double> predict(const Matrix &X, const vector<double> &w) {
140	int n = X.r, p = X.c; (void)p;
141	vector<double> yhat(n, 0.0);
142	for (int i = 0; i < n; ++i) {
143	double s = 0.0;
144	for (int j = 0; j < (int)w.size(); ++j) s += X(i, j) * w[j];
145	yhat[i] = s;
146	}
147	return yhat;
148	}
149
150	int main() {
151	ios::sync_with_stdio(false);
152	cin.tie(nullptr);
153
154	const int n_train = 80;
155	const int n_test = 1000;
156	const int p_max = 200; // sweep up to 200 features (over n)
157	const double noise_std = 0.2;
158	const double lambda = 1e-8; // tiny ridge (almost interpolating)
159
160	// Generate fixed train/test splits
161	vector<double> x_tr, y_tr, x_te, y_te;
162	make_dataset(n_train, x_tr, y_tr, noise_std, /seed=/123);
163	make_dataset(n_test, x_te, y_te, noise_std, /seed=/456);
164
165	// Pre-sample RFF parameters so p is expanded by prefix
166	RFFParams P = presample_rff(p_max, /sigma_w=/4.0, /seed=/789);
167
168	cout << "#p train_mse test_mse\n";
169	for (int p = 5; p <= p_max; p += 5) { // sweep feature counts
170	Matrix Xtr = build_features(x_tr, p, P);
171	Matrix Xte = build_features(x_te, p, P);
172
173	Matrix XtX; vector<double> Xty;
174	normalEquations(Xtr, y_tr, XtX, Xty, lambda);
175	vector<double> w = choleskySolve(XtX, Xty);
176
177	vector<double> yhat_tr = predict(Xtr, w);
178	vector<double> yhat_te = predict(Xte, w);
179	double tr_mse = mse(y_tr, yhat_tr);
180	double te_mse = mse(y_te, yhat_te);
181	cout << p << " " << tr_mse << " " << te_mse << "\n";
182	}
183	return 0;
184	}
185

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix { int r,c; vector<double> a; Matrix():r(0),c(0){} Matrix(int r_,int c_):r(r_),c(c_),a(r_c_,0.0){} double &operator()(int i,int j){return a[ic+j];} double operator()(int i,int j)const{return a[i*c+j];}};
5
6	Matrix transpose(const Matrix &M){ Matrix T(M.c,M.r); for(int i=0;i<M.r;++i) for(int j=0;j<M.c;++j) T(j,i)=M(i,j); return T; }
7
8	vector<double> choleskySolve(Matrix A, const vector<double> &b){ int n=A.r; assert(A.r==A.c); for(int i=0;i<n;++i){ for(int j=0;j<=i;++j){ double sum=A(i,j); for(int k=0;k<j;++k) sum-=A(i,k)A(j,k); if(i==j){ if(sum<=1e-18) sum=1e-18; A(i,j)=sqrt(sum);} else { A(i,j)=sum/A(j,j);} } for(int j=i+1;j<n;++j) A(i,j)=0.0; } vector<double> y(n,0.0); for(int i=0;i<n;++i){ double s=b[i]; for(int k=0;k<i;++k) s-=A(i,k)y[k]; y[i]=s/A(i,i);} vector<double> x(n,0.0); for(int i=n-1;i>=0;--i){ double s=y[i]; for(int k=i+1;k<n;++k) s-=A(k,i)*x[k]; x[i]=s/A(i,i);} return x; }
9
10	struct RFFParams{ vector<double> w,b; };
11	RFFParams presample_rff(int p_max,double sigma_w,uint32_t seed){ mt19937 rng(seed); normal_distribution<double> N(0.0,sigma_w); uniform_real_distribution<double> U(0.0,2.0*acos(-1.0)); RFFParams P; P.w.resize(p_max); P.b.resize(p_max); for(int j=0;j<p_max;++j){ P.w[j]=N(rng); P.b[j]=U(rng);} return P; }
12
13	Matrix build_features(const vector<double> &x,int p,const RFFParams &P){ int n=(int)x.size(); Matrix X(n,p); double scale=sqrt(2.0/max(1,p)); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ X(i,j)=scalecos(P.w[j]x[i]+P.b[j]); } } return X; }
14
15	void normalEquations(const Matrix &X,const vector<double> &y,Matrix &XtX,vector<double> &Xty,double lambda){ int n=X.r,p=X.c; XtX=Matrix(p,p); Xty.assign(p,0.0); for(int i=0;i<n;++i){ for(int j=0;j<p;++j){ double xij=X(i,j); Xty[j]+=xijy[i]; for(int k=0;k<=j;++k){ XtX(j,k)+=xijX(i,k); } } } for(int j=0;j<p;++j){ for(int k=0;k<j;++k) XtX(k,j)=XtX(j,k); XtX(j,j)+=lambda; } }
16
17	void make_dataset(int n, vector<double> &x, vector<double> &y, double noise_std, uint32_t seed){ mt19937 rng(seed); uniform_real_distribution<double> U(0.0,1.0); normal_distribution<double> N(0.0,noise_std); x.resize(n); y.resize(n); const double PI=acos(-1.0); for(int i=0;i<n;++i){ x[i]=U(rng); y[i]=sin(2.0PIx[i])+N(rng);} }
18
19	vector<double> predict(const Matrix &X,const vector<double> &w){ int n=X.r; vector<double> yhat(n,0.0); for(int i=0;i<n;++i){ double s=0.0; for(int j=0;j<(int)w.size();++j) s+=X(i,j)*w[j]; yhat[i]=s; } return yhat; }
20
21	double mse(const vector<double> &y,const vector<double> &yhat){ double s=0.0; for(size_t i=0;i<y.size();++i){ double d=yhat[i]-y[i]; s+=d*d; } return s/y.size(); }
22
23	int main(){ ios::sync_with_stdio(false); cin.tie(nullptr);
24	const int n_train=80, n_test=1000, p_max=200; const double noise_std=0.2;
25	vector<double> x_tr,y_tr,x_te,y_te; make_dataset(n_train,x_tr,y_tr,noise_std,123); make_dataset(n_test,x_te,y_te,noise_std,456);
26	RFFParams P = presample_rff(p_max,4.0,789);
27	cout << "#p test_mse_lambda1e-8 test_mse_lambda1e-2\n";
28	for(int p=5;p<=p_max;p+=5){ Matrix Xtr=build_features(x_tr,p,P), Xte=build_features(x_te,p,P);
29	// tiny ridge (near-interpolating)
30	Matrix A1; vector<double> b1; normalEquations(Xtr,y_tr,A1,b1,1e-8); vector<double> w1=choleskySolve(A1,b1);
31	double te1 = mse(y_te, predict(Xte,w1));
32	// moderate ridge (smoother)
33	Matrix A2; vector<double> b2; normalEquations(Xtr,y_tr,A2,b2,1e-2); vector<double> w2=choleskySolve(A2,b2);
34	double te2 = mse(y_te, predict(Xte,w2));
35	cout << p << " " << te1 << " " << te2 << "\n";
36	}
37	return 0; }
38