📚TheoryIntermediate

Bias-Variance Tradeoff

Key Points

•
The bias–variance tradeoff explains how prediction error splits into bias squared, variance, and irreducible noise for squared loss.
•
Bias measures how far the average model is from the truth, while variance measures how much the model fluctuates across different training sets.
•
As model complexity increases, bias typically decreases but variance increases, creating a U-shaped test error curve in classical settings.
•
Regularization techniques like ridge regression reduce variance by shrinking parameters, often at the cost of increased bias.
•
Cross-validation helps choose the model complexity or regularization strength that minimizes expected test error.
•
Deep learning sometimes exhibits double descent, where test error falls again beyond the interpolation threshold, complicating the classical picture.
•
Estimating bias and variance empirically requires repeated resampling and evaluation at fixed inputs.
•
Good practice balances model capacity, data size, and regularization to minimize overall mean squared error.

Prerequisites

→Probability basics (expectation and variance) — Bias and variance are defined using expectations over data distributions.
→Linear regression and least squares — Polynomial regression and normal equations underpin the examples and the variance behavior.
→Regularization (ridge/L2) — Understanding how L2 penalties affect parameter estimates explains the variance reduction mechanism.
→Cross-validation — Model selection to balance bias and variance relies on validation error estimation.
→Numerical linear algebra (conditioning, Gaussian elimination) — Stability issues in normal equations affect variance and motivate ridge.

Detailed Explanation

Tap terms for definitions

01Overview

Imagine trying to hit a target with darts while wearing slightly foggy glasses in a breezy room. If you always hit near the same wrong spot, that’s bias. If your hits scatter widely from throw to throw, that’s variance. In machine learning, the bias–variance tradeoff formalizes this intuition for prediction problems, especially with squared loss. The total expected squared error at a point can be decomposed into three parts: bias squared, variance, and irreducible noise. Bias comes from systematic mismatch between the model’s assumptions and the true data-generating process. Variance comes from sensitivity to the specific training sample you happened to observe. The irreducible noise comes from randomness in the world you can’t eliminate even with a perfect model. Classically, as you make a model more flexible (higher-degree polynomials, deeper trees), bias decreases because the model can represent more functions. However, flexibility also tends to increase variance because small changes in training data can swing the learned parameters more. This tug-of-war often yields a U-shaped test error as capacity grows, meaning there is an optimal middle ground. Regularization (like ridge or L2) counters variance by discouraging large parameter values, nudging the model toward simpler solutions. In modern deep learning, the picture is subtler: beyond the point where the model can interpolate the training data, test error can decrease again (double descent), but the bias–variance language still helps reason about stability, data needs, and regularization.

02Intuition & Analogies

Think of modeling like fitting a flexible metal wire to a bumpy road. If your wire is stiff, it can’t bend enough to follow the road’s twists—this is high bias: you consistently miss certain shapes. If your wire is ultra-flexible, it can snake through every pebble and groove—even the accidental bumps that are just noise—this is high variance: your fit depends heavily on small quirks of the particular road segment you saw. The ideal wire is flexible enough to follow real curves but not so flexible that it chases random bumps. Another analogy: studying for an exam. Memorizing a few formulas without understanding yields high bias—your answers are consistently off because your mental model is too simple. Cramming every past paper question and overfitting to their wording yields high variance—you perform well on those exact questions but falter on new ones. Balanced study—understanding principles and practicing varied questions—keeps bias and variance both in check. In software terms, bias is like a bug in the algorithm’s logic that affects every run the same way, while variance is like reading uninitialized memory: outcomes jump around unpredictably depending on incidental states. Regularization is a guardrail that limits how far the program’s internal states can drift, reducing this randomness at the cost of some rigidity. Cross-validation is your feedback loop: you try configurations and pick the one that generalizes best. Even if deep nets sometimes break the classic U-shape via double descent, the practical knobs—capacity, data size, regularization, and averaging—still control the balance.

03Formal Definition

Consider a target function f(x) and observations

y = f (x

) +

ε

with noise

ε

of zero mean and variance

σ^{2}

. Let

\hat

f(x;

D

) be a learned predictor from training data

D

. For squared loss, the pointwise expected prediction error decomposes as: \(

E_{D, ε}

[(

\hat

f(x) - y)^2] = (

E_{D}

[

\hat

f(x)] - f(x))^2 +

V ar_{D}

[

\hat

f(x)] +

σ^{2}

.\) Here, bias at x is \(

Bias

[

\hat

f(x)] =

E_{D}

[

\hat

f(x)] - f(x)\), the systematic error from the model class and learning algorithm. Variance at x is \(

Var

[

\hat

f(x)] =

E_{D}

[(

\hat

f(x) -

E_{D}

[

\hat

f(x)])^2]\), the sensitivity to the sampled dataset. The final term \(

σ^{2}

\) is irreducible noise from the data-generating process. Averaging over x with respect to the input distribution p(x) gives the expected generalization error under squared loss: \(

E_{x \sim p (x)} E_{D, ε}

[(

\hat

f(x) - y)^2] =

E_{x}

[(

Bias

[

\hat

f(x)])^2] +

E_{x}

[

Var

[

\hat

f(x)]] +

σ^{2}

.\) In classical settings, increasing hypothesis class capacity reduces the bias term but increases the variance term; regularization parameters (e.g., ridge \(

λ

\)) and sample size n modulate the magnitude of variance, often scaling roughly like parameters/n.

04When to Use

Use the bias–variance framework whenever you evaluate regression models under squared loss or when you want to reason about generalization behavior qualitatively. It is especially useful for comparing model classes (e.g., linear vs. polynomial vs. tree-based), deciding how much regularization to apply, or estimating how much additional data could help. For example, if a model underfits across many datasets and hyperparameters, you likely face high bias; techniques like increasing model capacity, adding relevant features, or reducing regularization can help. If a model performs well on training data but poorly on validation sets, you likely face high variance; techniques include stronger regularization, ensembling (bagging/averaging), simplifying the model, or getting more data. In pipeline design, this tradeoff guides choices such as: selecting polynomial degree, choosing k in k-NN (small k = low bias/high variance; large k = high bias/low variance), tuning tree depth, or setting ridge/Lasso penalties. Cross-validation and learning curves operationalize this theory: cross-validation chooses hyperparameters that minimize estimated test error; learning curves reveal whether error is data-limited (variance dominated) or capacity-limited (bias dominated). Even in deep learning with double descent, you still apply regularization (weight decay, early stopping, data augmentation) to tame variance and rely on validation error for selection.

⚠️Common Mistakes

Common mistakes include: (1) Confusing training error with generalization error. Low training MSE can coexist with high test MSE when variance is high; always evaluate via validation or cross-validation. (2) Ignoring the irreducible noise term (\sigma^2). Even a perfect model cannot beat this lower bound; chasing it can lead to overfitting. (3) Estimating bias/variance without fixing x. The decomposition is pointwise; averaging across x should come after computing pointwise quantities. (4) Using MSE decomposition with non-squared losses. The clean bias–variance equality holds specifically for squared loss; analogues for other losses exist but differ. (5) Over-regularizing to kill variance, thereby inducing large bias and systematic underfitting; monitor both training and validation errors. (6) Forgetting data leakage. Leakage shrinks apparent variance and bias estimates, giving misleadingly optimistic results. (7) Neglecting numerical stability: solving normal equations without care (poor conditioning) can inflate variance; prefer well-conditioned formulations, standardization, or regularization. (8) Misreading double descent as a reason to ignore validation. Even with overparameterization, validation-guided selection remains critical. To avoid these pitfalls: always separate train/validation/test, use repeated resampling when estimating bias/variance empirically, standardize features, consider ridge regularization for stability, and use appropriate loss-aware diagnostics.

Key Formulas

Bias–Variance Decomposition (Pointwise)

E_{D, ε} [(\hat{f} (x) - y)^{2}] = (E_{D} [\hat{f} (x)] - f (x))^{2} + Var_{D} [\hat{f} (x)] + σ^{2}

Explanation: At any fixed x, the expected squared prediction error equals bias squared plus variance plus irreducible noise. This identity holds exactly for squared loss.

Bias Definition

Bias [\hat{f} (x)] = E_{D} [\hat{f} (x)] - f (x)

Explanation: Bias is the systematic difference between the average prediction across datasets and the true function value at x. Large magnitude indicates underfitting or misspecification.

Variance Definition

Var [\hat{f} (x)] = E_{D} [(\hat{f} (x) - E_{D} [\hat{f} (x)])^{2}]

Explanation: Variance captures how much predictions fluctuate when retraining on new samples. High variance indicates sensitivity to data noise or limited data.

Expected Generalization Error

E_{x} E_{D, ε} [(\hat{f} (x) - y)^{2}] = E_{x} [(Bias [\hat{f} (x)])^{2}] + E_{x} [Var [\hat{f} (x)]] + σ^{2}

Explanation: Averaging over the input distribution yields the expected test MSE. This is the objective minimized by selecting appropriate capacity or regularization.

Ridge Regression Closed Form

\overset{w}{^}_{ridge} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y

Explanation: The L2-regularized least squares solution adds $ $λ$ I$ to stabilize inversion and reduce variance. Larger $ $λ$ $ increases bias but decreases variance.

K-Fold Cross-Validation Estimate

\hat{R}_{CV} = \frac{1}{K} k = 1 \sum K \frac{1}{∣ V _{k} ∣} (x_{i}, y_{i}) \in V_{k} \sum (\hat{f}_{- k} (x_{i}) - y_{i})^{2}

Explanation: This averages validation MSEs across K folds, each trained without the validation fold. It estimates test error to guide complexity/regularization choices.

Vandermonde Features

A_{ij} = x_{i}^{j}, 0 \leq j \leq d

Explanation: Polynomial regression uses powers of x as features. The matrix A collects these features to fit coefficients via least squares or ridge.

Law of Total Variance

Var (Y) = E [Var (Y ∣ X)] + Var (E [Y ∣ X])

Explanation: This variance identity underlies the bias–variance decomposition proof. It splits total variance into within-conditional and between-conditional components.

Complexity Analysis

For polynomial regression of degree d with n training samples and

m = d + 1

parameters, building the Vandermonde design matrix A costs O(n m) time and O(n m) space. Forming the normal equations

S = A^{T}

A (an m×m matrix) and

b = A^{T}

y costs O(n

m^{2}

) time and O(

m^{2}

) space because each outer product of a row contributes O(

m^{2}

). Solving the linear system S

w = b

via Gaussian elimination with partial pivoting is O(

m^{3}

) time and O(

m^{2}

) space. Therefore, per fit the dominant cost is O(n

m^{2}

m^{3}

). When n ≫ m, forming S dominates; when m is moderate but d is large, the O(

m^{3}

) solve can dominate. In the bias–variance simulations with T resamples and M test points, the total time is O(T (n

m^{2}

m^{3}

) + T M m), where the final term evaluates the polynomial at M points per trial. The space to store predictions for aggregation is O(T M), plus O(n m) if the design matrix is held explicitly per fit (we can stream features to reduce memory). In the ridge variant, replacing S by S +

λ I

does not change asymptotic complexity; numerical stability improves but costs only an additional O(m) to add the diagonal. For the regularization sweep across L values of

λ,

the total time is O(L T (n

m^{2}

m^{3}

) + L T M m). If λ is large, the effective condition number improves, reducing floating-point error though not big-O complexity. Overall, for small d (e.g., ≤10), these simulations are fast; for large d, consider orthogonal bases or QR decomposition (O(n

m^{2}

)) for better numerical stability with similar asymptotics.

Code Examples

Estimating Bias and Variance for Polynomial Regression via Resampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Generate noisy observations: y = f(x) + epsilon, with f(x) = sin(2*pi*x)
5 static inline double true_function(double x) {
6     return sin(2.0 * M_PI * x);
7 }
8 
9 // Build polynomial features up to degree d for a scalar x: [1, x, x^2, ...]
10 static inline void poly_features(double x, int d, vector<double>& feat) {
11     feat.resize(d + 1);
12     double p = 1.0;
13     for (int j = 0; j <= d; ++j) {
14         feat[j] = p;
15         p *= x;
16     }
17 }
18 
19 // Solve linear system S w = b using Gaussian elimination with partial pivoting
20 bool solve_linear_system(vector<vector<double>>& S, vector<double>& b, vector<double>& w) {
21     int m = (int)S.size();
22     w.assign(m, 0.0);
23     // Build augmented matrix [S | b]
24     vector<vector<double>> A(m, vector<double>(m + 1));
25     for (int i = 0; i < m; ++i) {
26         for (int j = 0; j < m; ++j) A[i][j] = S[i][j];
27         A[i][m] = b[i];
28     }
29     // Forward elimination
30     for (int col = 0; col < m; ++col) {
31         // Pivot selection
32         int pivot = col;
33         double best = fabs(A[col][col]);
34         for (int r = col + 1; r < m; ++r) {
35             double v = fabs(A[r][col]);
36             if (v > best) { best = v; pivot = r; }
37         }
38         if (best < 1e-12) return false; // singular/ill-conditioned
39         if (pivot != col) swap(A[pivot], A[col]);
40         // Normalize pivot row
41         double piv = A[col][col];
42         for (int c = col; c <= m; ++c) A[col][c] /= piv;
43         // Eliminate below
44         for (int r = col + 1; r < m; ++r) {
45             double factor = A[r][col];
46             if (fabs(factor) < 1e-18) continue;
47             for (int c = col; c <= m; ++c) {
48                 A[r][c] -= factor * A[col][c];
49             }
50         }
51     }
52     // Back substitution
53     for (int i = m - 1; i >= 0; --i) {
54         double sum = A[i][m];
55         for (int c = i + 1; c < m; ++c) sum -= A[i][c] * w[c];
56         w[i] = sum; // pivot entries are 1 after normalization
57     }
58     return true;
59 }
60 
61 // Fit polynomial regression with optional ridge regularization (lambda)
62 bool fit_polynomial(const vector<double>& xs, const vector<double>& ys, int degree, double lambda, vector<double>& w) {
63     int n = (int)xs.size();
64     int m = degree + 1;
65     vector<vector<double>> S(m, vector<double>(m, 0.0)); // A^T A (or with ridge)
66     vector<double> b(m, 0.0);                             // A^T y
67     vector<double> feat;
68     // Accumulate normal equations
69     for (int i = 0; i < n; ++i) {
70         poly_features(xs[i], degree, feat); // size m
71         for (int j = 0; j < m; ++j) {
72             b[j] += feat[j] * ys[i];
73             for (int k = 0; k < m; ++k) {
74                 S[j][k] += feat[j] * feat[k];
75             }
76         }
77     }
78     // Ridge: add lambda * I
79     if (lambda > 0) {
80         for (int j = 0; j < m; ++j) S[j][j] += lambda;
81     }
82     return solve_linear_system(S, b, w);
83 }
84 
85 static inline double predict_poly(const vector<double>& w, double x) {
86     double res = 0.0, p = 1.0;
87     for (double cj : w) { res += cj * p; p *= x; }
88     return res;
89 }
90 
91 int main() {
92     ios::sync_with_stdio(false);
93     cin.tie(nullptr);
94 
95     // Simulation parameters
96     int n = 25;                 // training samples per dataset
97     int degree = 9;             // polynomial degree (try 1, 3, 9)
98     int T = 200;                // number of resamples (datasets)
99     int M = 50;                 // number of test x points
100     double sigma = 0.15;        // noise standard deviation
101     unsigned seed = 42u;        // RNG seed for reproducibility
102 
103     mt19937 rng(seed);
104     uniform_real_distribution<double> unif01(0.0, 1.0);
105     normal_distribution<double> noise(0.0, sigma);
106 
107     // Prepare test grid
108     vector<double> xgrid(M);
109     for (int j = 0; j < M; ++j) xgrid[j] = (j + 0.5) / M; // midpoints in (0,1)
110 
111     // Store predictions for each test x across T datasets
112     vector<vector<double>> preds(M, vector<double>(T, 0.0));
113 
114     // Resampling loop
115     for (int t = 0; t < T; ++t) {
116         vector<double> xs(n), ys(n);
117         for (int i = 0; i < n; ++i) {
118             xs[i] = unif01(rng);
119             ys[i] = true_function(xs[i]) + noise(rng);
120         }
121         vector<double> w;
122         bool ok = fit_polynomial(xs, ys, degree, /*lambda=*/0.0, w);
123         if (!ok) {
124             cerr << "Warning: singular system at trial " << t << "\n";
125             // Skip predictions for this trial by reusing zeros (harmless in mean over many T)
126         } else {
127             for (int j = 0; j < M; ++j) preds[j][t] = predict_poly(w, xgrid[j]);
128         }
129     }
130 
131     // Estimate bias, variance, and MSE decomposition averaged over x
132     double avg_bias2 = 0.0, avg_var = 0.0, avg_mse = 0.0;
133     for (int j = 0; j < M; ++j) {
134         double fx = true_function(xgrid[j]);
135         double mean_pred = 0.0;
136         for (int t = 0; t < T; ++t) mean_pred += preds[j][t];
137         mean_pred /= T;
138         double bias = mean_pred - fx;
139         double var = 0.0;
140         double mse = 0.0;
141         for (int t = 0; t < T; ++t) {
142             double e = preds[j][t] - fx; // prediction error without noise
143             mse += e * e + sigma * sigma; // expected total error includes irreducible noise
144             double diff = preds[j][t] - mean_pred;
145             var += diff * diff;
146         }
147         var /= T;
148         mse /= T;
149         avg_bias2 += bias * bias;
150         avg_var += var;
151         avg_mse += mse;
152     }
153     avg_bias2 /= M;
154     avg_var  /= M;
155     avg_mse  /= M;
156 
157     cout.setf(std::ios::fixed); cout << setprecision(6);
158     cout << "Polynomial degree: " << degree << "\n";
159     cout << "Estimated average Bias^2: " << avg_bias2 << "\n";
160     cout << "Estimated average Variance: " << avg_var << "\n";
161     cout << "Irreducible noise (sigma^2): " << sigma * sigma << "\n";
162     cout << "Bias^2 + Variance + sigma^2: " << (avg_bias2 + avg_var + sigma * sigma) << "\n";
163     cout << "Estimated average total MSE (approx): " << avg_mse << "\n";
164 
165     return 0;
166 }
167

This program empirically estimates the bias–variance decomposition for polynomial regression. It repeatedly samples noisy training sets from y = sin(2πx) + ε, fits a degree-d polynomial via normal equations, predicts at fixed test points, and then aggregates predictions to compute bias squared and variance at each x. Finally, it averages across x to report Bias^2, Variance, and their sum plus the known noise variance σ^2. You can vary the degree to observe the classical tradeoff: low degree yields higher Bias^2 and low Variance; high degree yields lower Bias^2 and higher Variance.

Time: O(T (n (d+1)^2 + (d+1)^3) + T M (d+1))Space: O(M T + (d+1)^2)

Effect of Ridge Regularization on Variance (Fixed High-Degree Polynomial)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 static inline double ftrue(double x) { return sin(2.0 * M_PI * x); }
5 
6 static inline void poly_features(double x, int d, vector<double>& feat) {
7     feat.resize(d + 1);
8     double p = 1.0;
9     for (int j = 0; j <= d; ++j) { feat[j] = p; p *= x; }
10 }
11 
12 bool solve_linear_system(vector<vector<double>>& S, vector<double>& b, vector<double>& w) {
13     int m = (int)S.size();
14     w.assign(m, 0.0);
15     vector<vector<double>> A(m, vector<double>(m + 1));
16     for (int i = 0; i < m; ++i) { for (int j = 0; j < m; ++j) A[i][j] = S[i][j]; A[i][m] = b[i]; }
17     for (int col = 0; col < m; ++col) {
18         int piv = col; double best = fabs(A[col][col]);
19         for (int r = col + 1; r < m; ++r) { double v = fabs(A[r][col]); if (v > best) { best = v; piv = r; } }
20         if (best < 1e-12) return false;
21         if (piv != col) swap(A[piv], A[col]);
22         double pv = A[col][col];
23         for (int c = col; c <= m; ++c) A[col][c] /= pv;
24         for (int r = col + 1; r < m; ++r) {
25             double fac = A[r][col]; if (fabs(fac) < 1e-18) continue;
26             for (int c = col; c <= m; ++c) A[r][c] -= fac * A[col][c];
27         }
28     }
29     for (int i = m - 1; i >= 0; --i) {
30         double sum = A[i][m];
31         for (int c = i + 1; c < m; ++c) sum -= A[i][c] * w[c];
32         w[i] = sum;
33     }
34     return true;
35 }
36 
37 bool fit_poly_ridge(const vector<double>& xs, const vector<double>& ys, int d, double lambda, vector<double>& w) {
38     int n = (int)xs.size(); int m = d + 1;
39     vector<vector<double>> S(m, vector<double>(m, 0.0));
40     vector<double> b(m, 0.0), feat;
41     for (int i = 0; i < n; ++i) {
42         poly_features(xs[i], d, feat);
43         for (int j = 0; j < m; ++j) {
44             b[j] += feat[j] * ys[i];
45             for (int k = 0; k < m; ++k) S[j][k] += feat[j] * feat[k];
46         }
47     }
48     for (int j = 0; j < m; ++j) S[j][j] += lambda; // ridge
49     return solve_linear_system(S, b, w);
50 }
51 
52 static inline double predict(const vector<double>& w, double x) {
53     double r = 0.0, p = 1.0; for (double c : w) { r += c * p; p *= x; } return r;
54 }
55 
56 int main() {
57     ios::sync_with_stdio(false);
58     cin.tie(nullptr);
59 
60     int n = 25;            // samples per dataset
61     int d = 12;            // deliberately high degree to expose variance
62     int T = 150;           // resamples
63     int M = 60;            // test grid points
64     double sigma = 0.2;    // noise std
65 
66     mt19937 rng(123);
67     uniform_real_distribution<double> U(0.0, 1.0);
68     normal_distribution<double> N(0.0, sigma);
69 
70     vector<double> xgrid(M); for (int j = 0; j < M; ++j) xgrid[j] = (j + 0.5) / M;
71 
72     vector<double> lambdas = {0.0, 1e-4, 1e-3, 1e-2, 1e-1, 1.0};
73 
74     cout.setf(std::ios::fixed); cout << setprecision(6);
75     cout << "d = " << d << ", n = " << n << ", sigma = " << sigma << "\n";
76     cout << "lambda, avg_Bias2, avg_Var, sum(Bias2+Var+sigma^2)\n";
77 
78     for (double lambda : lambdas) {
79         vector<vector<double>> preds(M, vector<double>(T, 0.0));
80         for (int t = 0; t < T; ++t) {
81             vector<double> xs(n), ys(n);
82             for (int i = 0; i < n; ++i) { xs[i] = U(rng); ys[i] = ftrue(xs[i]) + N(rng); }
83             vector<double> w; bool ok = fit_poly_ridge(xs, ys, d, lambda, w);
84             if (!ok) continue;
85             for (int j = 0; j < M; ++j) preds[j][t] = predict(w, xgrid[j]);
86         }
87         double avgB2 = 0.0, avgV = 0.0;
88         for (int j = 0; j < M; ++j) {
89             double fx = ftrue(xgrid[j]);
90             double mu = 0.0; for (int t = 0; t < T; ++t) mu += preds[j][t]; mu /= T;
91             double b = mu - fx;
92             double v = 0.0; for (int t = 0; t < T; ++t) { double dlt = preds[j][t] - mu; v += dlt * dlt; } v /= T;
93             avgB2 += b * b; avgV += v;
94         }
95         avgB2 /= M; avgV /= M;
96         cout << lambda << ", " << avgB2 << ", " << avgV << ", " << (avgB2 + avgV + sigma * sigma) << "\n";
97     }
98     return 0;
99 }
100

This program fixes a high-degree polynomial model and sweeps ridge regularization strengths. For each λ, it repeatedly resamples data, fits ridge, and aggregates predictions to estimate average Bias^2 and Variance over a test grid. You will observe that as λ increases from 0, variance typically drops sharply while bias grows, and their sum plus σ^2 approximates the expected MSE.

Time: O(|λ| · T · (n (d+1)^2 + (d+1)^3) + |λ| · T · M (d+1))Space: O(M T + (d+1)^2)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Generate noisy observations: y = f(x) + epsilon, with f(x) = sin(2pix)
5	static inline double true_function(double x) {
6	return sin(2.0 * M_PI * x);
7	}
8
9	// Build polynomial features up to degree d for a scalar x: [1, x, x^2, ...]
10	static inline void poly_features(double x, int d, vector<double>& feat) {
11	feat.resize(d + 1);
12	double p = 1.0;
13	for (int j = 0; j <= d; ++j) {
14	feat[j] = p;
15	p *= x;
16	}
17	}
18
19	// Solve linear system S w = b using Gaussian elimination with partial pivoting
20	bool solve_linear_system(vector<vector<double>>& S, vector<double>& b, vector<double>& w) {
21	int m = (int)S.size();
22	w.assign(m, 0.0);
23	// Build augmented matrix [S \| b]
24	vector<vector<double>> A(m, vector<double>(m + 1));
25	for (int i = 0; i < m; ++i) {
26	for (int j = 0; j < m; ++j) A[i][j] = S[i][j];
27	A[i][m] = b[i];
28	}
29	// Forward elimination
30	for (int col = 0; col < m; ++col) {
31	// Pivot selection
32	int pivot = col;
33	double best = fabs(A[col][col]);
34	for (int r = col + 1; r < m; ++r) {
35	double v = fabs(A[r][col]);
36	if (v > best) { best = v; pivot = r; }
37	}
38	if (best < 1e-12) return false; // singular/ill-conditioned
39	if (pivot != col) swap(A[pivot], A[col]);
40	// Normalize pivot row
41	double piv = A[col][col];
42	for (int c = col; c <= m; ++c) A[col][c] /= piv;
43	// Eliminate below
44	for (int r = col + 1; r < m; ++r) {
45	double factor = A[r][col];
46	if (fabs(factor) < 1e-18) continue;
47	for (int c = col; c <= m; ++c) {
48	A[r][c] -= factor * A[col][c];
49	}
50	}
51	}
52	// Back substitution
53	for (int i = m - 1; i >= 0; --i) {
54	double sum = A[i][m];
55	for (int c = i + 1; c < m; ++c) sum -= A[i][c] * w[c];
56	w[i] = sum; // pivot entries are 1 after normalization
57	}
58	return true;
59	}
60
61	// Fit polynomial regression with optional ridge regularization (lambda)
62	bool fit_polynomial(const vector<double>& xs, const vector<double>& ys, int degree, double lambda, vector<double>& w) {
63	int n = (int)xs.size();
64	int m = degree + 1;
65	vector<vector<double>> S(m, vector<double>(m, 0.0)); // A^T A (or with ridge)
66	vector<double> b(m, 0.0); // A^T y
67	vector<double> feat;
68	// Accumulate normal equations
69	for (int i = 0; i < n; ++i) {
70	poly_features(xs[i], degree, feat); // size m
71	for (int j = 0; j < m; ++j) {
72	b[j] += feat[j] * ys[i];
73	for (int k = 0; k < m; ++k) {
74	S[j][k] += feat[j] * feat[k];
75	}
76	}
77	}
78	// Ridge: add lambda * I
79	if (lambda > 0) {
80	for (int j = 0; j < m; ++j) S[j][j] += lambda;
81	}
82	return solve_linear_system(S, b, w);
83	}
84
85	static inline double predict_poly(const vector<double>& w, double x) {
86	double res = 0.0, p = 1.0;
87	for (double cj : w) { res += cj * p; p *= x; }
88	return res;
89	}
90
91	int main() {
92	ios::sync_with_stdio(false);
93	cin.tie(nullptr);
94
95	// Simulation parameters
96	int n = 25; // training samples per dataset
97	int degree = 9; // polynomial degree (try 1, 3, 9)
98	int T = 200; // number of resamples (datasets)
99	int M = 50; // number of test x points
100	double sigma = 0.15; // noise standard deviation
101	unsigned seed = 42u; // RNG seed for reproducibility
102
103	mt19937 rng(seed);
104	uniform_real_distribution<double> unif01(0.0, 1.0);
105	normal_distribution<double> noise(0.0, sigma);
106
107	// Prepare test grid
108	vector<double> xgrid(M);
109	for (int j = 0; j < M; ++j) xgrid[j] = (j + 0.5) / M; // midpoints in (0,1)
110
111	// Store predictions for each test x across T datasets
112	vector<vector<double>> preds(M, vector<double>(T, 0.0));
113
114	// Resampling loop
115	for (int t = 0; t < T; ++t) {
116	vector<double> xs(n), ys(n);
117	for (int i = 0; i < n; ++i) {
118	xs[i] = unif01(rng);
119	ys[i] = true_function(xs[i]) + noise(rng);
120	}
121	vector<double> w;
122	bool ok = fit_polynomial(xs, ys, degree, /lambda=/0.0, w);
123	if (!ok) {
124	cerr << "Warning: singular system at trial " << t << "\n";
125	// Skip predictions for this trial by reusing zeros (harmless in mean over many T)
126	} else {
127	for (int j = 0; j < M; ++j) preds[j][t] = predict_poly(w, xgrid[j]);
128	}
129	}
130
131	// Estimate bias, variance, and MSE decomposition averaged over x
132	double avg_bias2 = 0.0, avg_var = 0.0, avg_mse = 0.0;
133	for (int j = 0; j < M; ++j) {
134	double fx = true_function(xgrid[j]);
135	double mean_pred = 0.0;
136	for (int t = 0; t < T; ++t) mean_pred += preds[j][t];
137	mean_pred /= T;
138	double bias = mean_pred - fx;
139	double var = 0.0;
140	double mse = 0.0;
141	for (int t = 0; t < T; ++t) {
142	double e = preds[j][t] - fx; // prediction error without noise
143	mse += e * e + sigma * sigma; // expected total error includes irreducible noise
144	double diff = preds[j][t] - mean_pred;
145	var += diff * diff;
146	}
147	var /= T;
148	mse /= T;
149	avg_bias2 += bias * bias;
150	avg_var += var;
151	avg_mse += mse;
152	}
153	avg_bias2 /= M;
154	avg_var /= M;
155	avg_mse /= M;
156
157	cout.setf(std::ios::fixed); cout << setprecision(6);
158	cout << "Polynomial degree: " << degree << "\n";
159	cout << "Estimated average Bias^2: " << avg_bias2 << "\n";
160	cout << "Estimated average Variance: " << avg_var << "\n";
161	cout << "Irreducible noise (sigma^2): " << sigma * sigma << "\n";
162	cout << "Bias^2 + Variance + sigma^2: " << (avg_bias2 + avg_var + sigma * sigma) << "\n";
163	cout << "Estimated average total MSE (approx): " << avg_mse << "\n";
164
165	return 0;
166	}
167

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	static inline double ftrue(double x) { return sin(2.0 * M_PI * x); }
5
6	static inline void poly_features(double x, int d, vector<double>& feat) {
7	feat.resize(d + 1);
8	double p = 1.0;
9	for (int j = 0; j <= d; ++j) { feat[j] = p; p *= x; }
10	}
11
12	bool solve_linear_system(vector<vector<double>>& S, vector<double>& b, vector<double>& w) {
13	int m = (int)S.size();
14	w.assign(m, 0.0);
15	vector<vector<double>> A(m, vector<double>(m + 1));
16	for (int i = 0; i < m; ++i) { for (int j = 0; j < m; ++j) A[i][j] = S[i][j]; A[i][m] = b[i]; }
17	for (int col = 0; col < m; ++col) {
18	int piv = col; double best = fabs(A[col][col]);
19	for (int r = col + 1; r < m; ++r) { double v = fabs(A[r][col]); if (v > best) { best = v; piv = r; } }
20	if (best < 1e-12) return false;
21	if (piv != col) swap(A[piv], A[col]);
22	double pv = A[col][col];
23	for (int c = col; c <= m; ++c) A[col][c] /= pv;
24	for (int r = col + 1; r < m; ++r) {
25	double fac = A[r][col]; if (fabs(fac) < 1e-18) continue;
26	for (int c = col; c <= m; ++c) A[r][c] -= fac * A[col][c];
27	}
28	}
29	for (int i = m - 1; i >= 0; --i) {
30	double sum = A[i][m];
31	for (int c = i + 1; c < m; ++c) sum -= A[i][c] * w[c];
32	w[i] = sum;
33	}
34	return true;
35	}
36
37	bool fit_poly_ridge(const vector<double>& xs, const vector<double>& ys, int d, double lambda, vector<double>& w) {
38	int n = (int)xs.size(); int m = d + 1;
39	vector<vector<double>> S(m, vector<double>(m, 0.0));
40	vector<double> b(m, 0.0), feat;
41	for (int i = 0; i < n; ++i) {
42	poly_features(xs[i], d, feat);
43	for (int j = 0; j < m; ++j) {
44	b[j] += feat[j] * ys[i];
45	for (int k = 0; k < m; ++k) S[j][k] += feat[j] * feat[k];
46	}
47	}
48	for (int j = 0; j < m; ++j) S[j][j] += lambda; // ridge
49	return solve_linear_system(S, b, w);
50	}
51
52	static inline double predict(const vector<double>& w, double x) {
53	double r = 0.0, p = 1.0; for (double c : w) { r += c * p; p *= x; } return r;
54	}
55
56	int main() {
57	ios::sync_with_stdio(false);
58	cin.tie(nullptr);
59
60	int n = 25; // samples per dataset
61	int d = 12; // deliberately high degree to expose variance
62	int T = 150; // resamples
63	int M = 60; // test grid points
64	double sigma = 0.2; // noise std
65
66	mt19937 rng(123);
67	uniform_real_distribution<double> U(0.0, 1.0);
68	normal_distribution<double> N(0.0, sigma);
69
70	vector<double> xgrid(M); for (int j = 0; j < M; ++j) xgrid[j] = (j + 0.5) / M;
71
72	vector<double> lambdas = {0.0, 1e-4, 1e-3, 1e-2, 1e-1, 1.0};
73
74	cout.setf(std::ios::fixed); cout << setprecision(6);
75	cout << "d = " << d << ", n = " << n << ", sigma = " << sigma << "\n";
76	cout << "lambda, avg_Bias2, avg_Var, sum(Bias2+Var+sigma^2)\n";
77
78	for (double lambda : lambdas) {
79	vector<vector<double>> preds(M, vector<double>(T, 0.0));
80	for (int t = 0; t < T; ++t) {
81	vector<double> xs(n), ys(n);
82	for (int i = 0; i < n; ++i) { xs[i] = U(rng); ys[i] = ftrue(xs[i]) + N(rng); }
83	vector<double> w; bool ok = fit_poly_ridge(xs, ys, d, lambda, w);
84	if (!ok) continue;
85	for (int j = 0; j < M; ++j) preds[j][t] = predict(w, xgrid[j]);
86	}
87	double avgB2 = 0.0, avgV = 0.0;
88	for (int j = 0; j < M; ++j) {
89	double fx = ftrue(xgrid[j]);
90	double mu = 0.0; for (int t = 0; t < T; ++t) mu += preds[j][t]; mu /= T;
91	double b = mu - fx;
92	double v = 0.0; for (int t = 0; t < T; ++t) { double dlt = preds[j][t] - mu; v += dlt * dlt; } v /= T;
93	avgB2 += b * b; avgV += v;
94	}
95	avgB2 /= M; avgV /= M;
96	cout << lambda << ", " << avgB2 << ", " << avgV << ", " << (avgB2 + avgV + sigma * sigma) << "\n";
97	}
98	return 0;
99	}
100