📚TheoryAdvanced

Statistical Learning Theory

Key Points

•
Statistical learning theory explains why a model that fits training data can still predict well on unseen data by relating true risk to empirical risk plus a complexity term.
•
Empirical Risk Minimization (ERM) picks the model that minimizes average loss on the sample, while Structural Risk Minimization (SRM) balances fit and complexity (e.g., via regularization).
•
Generalization bounds state that with high probability R(f) ≤ R̂(f) + complexity term, where the term depends on VC dimension, Rademacher complexity, or covering numbers.
•
Sample complexity scales roughly like n = O(d/ $ϵ^{2}$ ) to achieve an $ϵ$ -accurate classifier for VC dimension d, ignoring log factors.
•
Bias-variance decomposition says expected squared error splits into Bias² + Variance + Noise, explaining underfitting vs overfitting.
•
Rademacher complexity measures how well a model class can correlate with random noise; lower values imply better generalization.
•
Cross-validation is a practical SRM tool that approximates the complexity penalty by holding out data rather than relying on theory constants.
•
C++ implementations can illustrate ERM, SRM, bias-variance, and empirical Rademacher complexity with synthetic data and simple optimizers.

Prerequisites

→Probability and Expectation — Risk, empirical risk, and generalization bounds are expressed as expectations and probabilities.
→Concentration Inequalities — VC and Rademacher bounds rely on inequalities like Hoeffding and McDiarmid.
→Linear Algebra — Many models (linear predictors, norms) and proofs (norm bounds, margins) use vector and matrix concepts.
→Optimization (Gradient Descent) — Practical ERM uses convex surrogates optimized by gradient-based methods.
→Calculus and Lipschitz Continuity — Rademacher-based bounds require Lipschitz properties to move from function classes to loss classes.
→Machine Learning Basics — Understanding losses, training/test splits, and cross-validation grounds the theory in practice.
→Programming in C++ — Implementing ERM, SRM, and experiments requires comfort with vectors, loops, and numerics.
→Matrix Computations — Closed-form solutions and efficient implementations often use linear algebra routines.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Why does a model that performs brilliantly on your homework sometimes bomb the exam? Statistical learning theory gives a principled answer: it quantifies how training performance translates to test performance. Concept: At its core, the theory studies the gap between true risk (expected loss on new data) and empirical risk (average loss on the sample you trained on). This gap is controlled by the richness of your hypothesis class—the set of models you allow yourself to choose from. Richer classes can fit more patterns, including noise, so they need more data to generalize. The theory introduces complexity measures such as VC dimension, Rademacher complexity, and covering numbers to capture this richness and derive probability guarantees. Example: If you let a classifier draw arbitrarily wiggly boundaries, it can perfectly separate almost any finite training set (zero empirical risk) but may fail on new points. A linear classifier, being simpler, may not fit training data perfectly but often generalizes better with limited samples. Statistical learning theory not only formalizes this intuition but also gives sample complexity estimates like n = O(d/ε²), says how to control complexity (regularization, margins), and explains why practical procedures like cross-validation work as proxies for theoretical trade-offs.

02Intuition & Analogies

Imagine two students preparing for a math contest. Student A memorizes solutions to past problems (high-capacity strategy), while Student B learns the underlying theorems and problem-solving patterns (moderate capacity). On the practice set (training data), A might outperform B, but on the actual contest (test data), B’s understanding tends to transfer better. Statistical learning theory captures this difference by asking: how much of what you learned is real pattern versus memorized noise? The more ways you can explain the training data (the larger the hypothesis class), the easier it is to memorize quirks—like learning every past problem’s answer key. Another analogy: Fitting a curve through points is like choosing the right tool to trace a road. A rigid ruler (low capacity) can’t follow tight bends (high bias), while a super-flexible spline (high capacity) can follow every bump, even potholes (high variance). The sweet spot is a tool that is flexible enough to capture the road’s general shape but not so flexible that it hugs every imperfection. SRM is like carrying a toolkit of rulers with increasing flexibility and selecting the one that balances road-fit with smoothness using a validation check. Random labels thought experiment: If your class can perfectly fit random labels, it has high capacity—akin to a parrot repeating any phrase it hears. Rademacher complexity formalizes this by measuring how well your model class can correlate with random coin flips assigned to your data. Low correlation with randomness is a sign of true learning. Bias-variance decomposition completes the picture by explaining that errors come from systematic mismatch (bias), instability across datasets (variance), and unavoidable measurement noise.

03Formal Definition

Let (X,Y)

\sim

P be an unknown distribution on inputs X and outputs Y. A learning algorithm selects f from a hypothesis class

F

to minimize expected loss (risk) R(f) =

E_{(X, Y) \sim P}

[L(f(X),Y)], which is unobservable. We use the empirical risk on a sample S = \{(

x_{i}

y_{i}

)\}_{

i=1}^n:

\hat

R_{S}

(f) =

\frac{1}{n} \sum_{i = 1}^{n}

L(f(

x_{i}

y_{i}

). Empirical Risk Minimization (ERM) chooses

\hat

\in

ar g min_{f \in F}

\hat

R_{S}

(f). Generalization focuses on bounding R(

\hat

f) in terms of

\hat

R_{S}

(

\hat

f) plus a complexity term depending on

F

, n, and confidence

δ

. Complexity measures include: (1) VC dimension

d_{V C}

(

F

) for binary classification, yielding uniform convergence bounds of the form R(f)

\leq

\hat

R(f) + O\!

\left

(

\sqrt

{

\frac{d _{V C} l o g ( n / d _{V C} ) + l o g ( 1/ δ )}{n}

}

\right

). (2) Rademacher complexity

\hat{R}_{S}

(

F

) =

E

\sigma\left

[

sup_{f \in F}

\frac{1}{n} \sum_{i = 1}^{n}

σ_{i}

x_{i}

)

\right

], which adapts to the data and applies to real-valued losses via Lipschitz contraction. (3) Covering numbers N(

ϵ

F

∣ \cdot ∥

) that control how finely

F

must be approximated. Structural Risk Minimization constructs nested classes

F_{1}

\subset

F_{2}

\subset

\dots

, and selects k to minimize a bound or a validation estimate:

\hat

\hat

f_{k}

) +

penalty

(

F_{k}

). For squared loss, expected error decomposes as

E

[(

\hat

f(X)-Y)^2] =

Bias^{2}

Var

σ^{2}

, clarifying under- and over-fitting.

04When to Use

Model selection and regularization: When choosing among linear models of different feature expansions, tree depths, or neural network widths, SRM guides you to prefer models that reduce empirical risk without exploding complexity. Practically, this is cross-validation, AIC/BIC, or explicit penalties (L2/L1).
Deciding if you have enough data: Use VC or Rademacher-based estimates to reason about how error should shrink with more samples, or to justify data collection needs: n \propto d/\epsilon^2 roughly for classification.
Auditing generalization: If a model fits random labels or shows near-zero training error with unstable validation error, theory flags high capacity and potential overfitting. Estimating empirical Rademacher complexity (or doing the random labels test) is an actionable diagnostic.
Designing algorithms: Margin-based bounds motivate large-margin classifiers (SVM), norm constraints motivate weight decay, and Lipschitz arguments motivate gradient clipping. For linear predictors, bounds in terms of |w| and |x| explain why feature scaling and norm constraints help.
Communicating guarantees: In safety-critical or regulated settings, high-probability bounds provide interpretable statements like “with probability at least 1-\delta, test error is within \epsilon of training error,” clarifying risks to stakeholders. Use these to set conservative thresholds and data requirements.

⚠️Common Mistakes

Equating low training loss with good generalization: Without accounting for complexity, ERM can overfit. Always report held-out or cross-validation performance.
Misreading bounds as tight predictions: Generalization bounds often have loose constants and are worst-case. Use them for trends and design, not exact forecasts of test error.
Ignoring assumptions (i.i.d., bounded loss): Many results assume independent, identically distributed samples and bounded or sub-Gaussian losses. Distribution shift or heavy tails can invalidate guarantees; reconsider the setup or use robust methods.
Conflating sample and computational complexity: A class may have good sample complexity but be computationally hard to optimize (e.g., ERM for 0–1 loss). Choose surrogate losses and tractable algorithms.
Data leakage during SRM: Using test data to pick hyperparameters invalidates guarantees and inflates performance. Use proper k-fold CV or a separate validation set.
Overfitting the validation set: Excessive hyperparameter search can overfit validation; mitigate with nested CV or a final untouched test set.
Misusing VC dimension: VC applies directly to binary classification with 0–1 loss; for regression or margin-based losses, prefer Rademacher, covering numbers, or margin bounds.
Forgetting scaling and norms: Norm-based bounds motivate feature scaling and weight decay; unscaled features can inflate effective capacity and harm both theory and practice.

Key Formulas

True Risk

R (f) = E_{(X, Y) \sim P} [L (f (X), Y)]

Explanation: The expected loss over the unknown data distribution. It measures real-world performance but is unobservable directly.

Empirical Risk

\hat{R}_{S} (f) = \frac{1}{n} i = 1 \sum n L (f (x_{i}), y_{i})

Explanation: The average loss on the training sample. It is computable and serves as a proxy for true risk.

ERM Rule

\hat{f} \in f \in F arg min \hat{R}_{S} (f)

Explanation: Empirical Risk Minimization selects the function with the lowest empirical risk within the hypothesis class.

Hoeffding (finite class, via union bound)

Pr (f \in F sup ∣ R (f) - \hat{R} (f) ∣ > ϵ) \leq 2 exp (- 2 n ϵ^{2})

Explanation: For bounded losses and finite classes, deviations between true and empirical risk are exponentially unlikely in n. Union bound extends this across the class.

VC Generalization Bound (0–1 loss)

R (f) \leq \hat{R} (f) + \frac{1}{n} (d lo g \frac{2 e n}{d} + lo g \frac{2}{δ})

Explanation: With probability at least 1- $δ$ , every classifier in a class of VC dimension d has true risk within this margin of empirical risk.

Sample Complexity (VC)

n \geq C \frac{d lo g ( 1/ ϵ ) + lo g ( 1/ δ )}{ϵ ^{2}}

Explanation: To achieve error at most $ϵ$ with confidence 1- $δ$ , the required sample size grows linearly with d and as 1/ $ϵ^{2}$ (up to logs).

Empirical Rademacher Complexity

\hat{R}_{S} (F) = E_{σ} [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (x_{i})]

Explanation: Measures how well the class can fit random $\pm$ 1 signs on the given sample, adapting to data geometry.

Rademacher Bound (L-Lipschitz loss)

R (f) \leq \hat{R} (f) + 2 L \hat{R}_{S} (F) + 3 \frac{lo g ( 2/ δ )}{2 n}

Explanation: If the loss is L-Lipschitz and bounded, true risk is close to empirical risk plus a term scaling with Rademacher complexity and confidence.

Linear Predictors (Norm-Bounded)

\hat{R}_{S} (F_{B}) = \frac{B}{n} E_{σ} i = 1 \sum n σ_{i} x_{i}_{2} \leq \frac{BR}{n}

Explanation: For linear functions with \ $∣ w ∥$ _2 $\leq$ B and \ $∣ x_{i} ∥$ _2 $\leq$ R, Rademacher complexity decays as 1/ $n$ . This motivates L2 regularization and feature scaling.

Bias–Variance Decomposition

E [(\hat{f} (X) - Y)^{2}] = Bias^{2} + Var + σ^{2}

Explanation: For squared loss, expected error splits into irreducible noise, variance across training sets, and squared bias from model misspecification.

Sauer–Shelah Lemma

m_{F} (n) \leq i = 0 \sum d (i n) \leq (\frac{e n}{d})^{d}

Explanation: The number of distinct labelings realizable on n points by a class of VC dimension d is polynomially bounded, enabling VC generalization bounds.

SRM Objective

f \in F_{k} arg min \hat{R} (f) + λ Ω (f)

Explanation: Augments empirical risk with a penalty that increases with model complexity (e.g., norm or number of parameters) to find a good trade-off.

Complexity Analysis

Statistical learning theory separates sample complexity from computational complexity. Sample complexity describes how many examples n are needed to achieve accuracy ε with confidence 1−

δ;

for binary classification with VC dimension d, n typically scales like O((d log(1/

ε) + l o g (1/ δ)) / ε^{2}) .

Rademacher-based bounds replace d with a data-dependent quantity that often scales as O(BR/√n) for linear predictors with weight norm bound B and feature radius R. Thus, doubling data roughly shrinks the estimation error by a factor of √2 when other factors are fixed. Computationally, ERM over complex classes may be intractable (e.g., minimizing 0–1 loss is NP-hard). Practical learners minimize convex surrogates (hinge, logistic, squared loss) with algorithms like gradient descent or stochastic gradient descent (SGD). For a dataset of size n and d features, one pass of batch gradient descent costs O(nd) time and O(d) space; T iterations cost O(Tnd). Cross-validation for SRM multiplies this cost by k (folds) and by the number of hyperparameters evaluated. For example, k-fold CV over L choices of λ for ridge regression costs O(kL T n d). Estimating empirical Rademacher complexity for linear predictors via Monte Carlo requires drawing M Rademacher vectors and computing M norms of ∑

σ_{i}

x_{i}

: time O(M n d) and space O(d). Bias–variance experiments with R repetitions over m test points cost O(R T n d + R m d). Overall, while theoretical guarantees guide design, practical implementations must balance optimization time, memory, and the overhead of validation or complexity estimation.

Code Examples

ERM and SRM with Ridge Regression and k-fold Cross-Validation

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Dataset {
5     vector<vector<double>> X; // n x d
6     vector<double> y;         // n
7 };
8 
9 // Generate synthetic linear data: y = w_true^T x + noise
10 Dataset make_linear_data(int n, int d, double noise_std, unsigned seed=42) {
11     mt19937 rng(seed);
12     normal_distribution<double> nd(0.0, 1.0);
13     normal_distribution<double> noise(0.0, noise_std);
14 
15     vector<double> w_true(d, 0.0);
16     for (int j = 0; j < d; ++j) w_true[j] = nd(rng); // random ground truth
17 
18     Dataset data; data.X.assign(n, vector<double>(d)); data.y.assign(n, 0.0);
19     for (int i = 0; i < n; ++i) {
20         for (int j = 0; j < d; ++j) data.X[i][j] = nd(rng);
21         double yi = 0.0;
22         for (int j = 0; j < d; ++j) yi += w_true[j] * data.X[i][j];
23         yi += noise(rng);
24         data.y[i] = yi;
25     }
26     return data;
27 }
28 
29 // Compute predictions: y_hat = X w
30 vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
31     int n = (int)X.size(); int d = (int)w.size();
32     vector<double> yhat(n, 0.0);
33     for (int i = 0; i < n; ++i) {
34         double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j] * w[j];
35         yhat[i] = s;
36     }
37     return yhat;
38 }
39 
40 // Mean squared error (empirical risk for squared loss)
41 double mse(const vector<vector<double>>& X, const vector<double>& y, const vector<double>& w) {
42     int n = (int)X.size();
43     vector<double> yhat = predict(X, w);
44     double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i] - y[i]; s += e*e; }
45     return s / n;
46 }
47 
48 // Batch gradient descent for ridge regression: minimize (1/n)||Xw - y||^2 + lambda ||w||^2
49 vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y,
50                         double lambda, double lr=0.05, int iters=200) {
51     int n = (int)X.size(); int d = (int)X[0].size();
52     vector<double> w(d, 0.0);
53     for (int t = 0; t < iters; ++t) {
54         vector<double> grad(d, 0.0);
55         for (int i = 0; i < n; ++i) {
56             double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
57             double r = yi_hat - y[i];
58             for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
59         }
60         for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j]; // ridge penalty gradient
61         for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
62     }
63     return w;
64 }
65 
66 // k-fold cross-validation to select lambda (SRM)
67 double kfold_cv(const Dataset& data, const vector<double>& lambdas, int k, double lr, int iters, double& best_lambda) {
68     int n = (int)data.X.size();
69     vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
70     mt19937 rng(123);
71     shuffle(idx.begin(), idx.end(), rng);
72 
73     vector<vector<int>> folds(k);
74     for (int i = 0; i < n; ++i) folds[i % k].push_back(idx[i]);
75 
76     double best_cv = numeric_limits<double>::infinity();
77     best_lambda = lambdas[0];
78 
79     for (double lam : lambdas) {
80         double cv_err = 0.0;
81         for (int f = 0; f < k; ++f) {
82             // Build train/val split
83             vector<vector<double>> Xtr, Xval; vector<double> ytr, yval;
84             for (int g = 0; g < k; ++g) {
85                 for (int id : folds[g]) {
86                     if (g == f) { Xval.push_back(data.X[id]); yval.push_back(data.y[id]); }
87                     else { Xtr.push_back(data.X[id]); ytr.push_back(data.y[id]); }
88                 }
89             }
90             vector<double> w = ridge_gd(Xtr, ytr, lam, lr, iters);
91             cv_err += mse(Xval, yval, w);
92         }
93         cv_err /= k;
94         if (cv_err < best_cv) { best_cv = cv_err; best_lambda = lam; }
95     }
96     return best_cv;
97 }
98 
99 int main(){
100     ios::sync_with_stdio(false);
101     cin.tie(nullptr);
102 
103     int n = 1200, d = 20; double noise_std = 0.5;
104     Dataset data = make_linear_data(n, d, noise_std, 7);
105 
106     // Train/test split
107     int n_train = (int)(0.8 * n);
108     vector<vector<double>> Xtr(data.X.begin(), data.X.begin() + n_train);
109     vector<double> ytr(data.y.begin(), data.y.begin() + n_train);
110     vector<vector<double>> Xte(data.X.begin() + n_train, data.X.end());
111     vector<double> yte(data.y.begin() + n_train, data.y.end());
112 
113     Dataset dtr{Xtr, ytr};
114 
115     vector<double> lambdas = {0.0, 1e-3, 1e-2, 1e-1, 1.0};
116     double best_lambda = lambdas[0];
117     double cv_err = kfold_cv(dtr, lambdas, 5, 0.05, 300, best_lambda);
118 
119     vector<double> w = ridge_gd(Xtr, ytr, best_lambda, 0.05, 500);
120     double train_mse = mse(Xtr, ytr, w);
121     double test_mse  = mse(Xte, yte, w);
122 
123     cout << fixed << setprecision(6);
124     cout << "Best lambda (SRM): " << best_lambda << "\n";
125     cout << "CV MSE: " << cv_err << "\n";
126     cout << "Train MSE (empirical risk): " << train_mse << "\n";
127     cout << "Test MSE (proxy for true risk): " << test_mse << "\n";
128     return 0;
129 }
130

We generate synthetic linear data and minimize empirical risk (MSE) with ridge (L2) regularization using batch gradient descent. We perform k-fold cross-validation over a grid of lambdas to implement SRM: pick the class/penalty that best balances fit and complexity. Finally, we train on the full training set with the chosen lambda and report training (empirical) and test (proxy for true) risks.

Time: O(k · L · T · n · d) for cross-validation (k folds, L lambdas, T iterations), plus O(T · n · d) for final training.Space: O(n · d) to store data and O(d) for model/gradients.

Bias–Variance Decomposition Experiment (1D Polynomial Ridge Regression)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // True function and noise
5 inline double f_true(double x) { return sin(2.0 * M_PI * x); }
6 
7 struct Sample1D { vector<double> x, y; };
8 
9 Sample1D make_data(int n, double noise_std, unsigned seed) {
10     mt19937 rng(seed); uniform_real_distribution<double> unif(0.0, 1.0);
11     normal_distribution<double> noise(0.0, noise_std);
12     Sample1D s; s.x.resize(n); s.y.resize(n);
13     for (int i = 0; i < n; ++i) { s.x[i] = unif(rng); s.y[i] = f_true(s.x[i]) + noise(rng); }
14     return s;
15 }
16 
17 // Build polynomial features up to degree p (including bias term)
18 vector<vector<double>> poly_features(const vector<double>& x, int p) {
19     int n = (int)x.size(); vector<vector<double>> X(n, vector<double>(p+1, 1.0));
20     for (int i = 0; i < n; ++i) {
21         double powx = 1.0;
22         for (int j = 1; j <= p; ++j) { powx *= x[i]; X[i][j] = powx; }
23     }
24     return X;
25 }
26 
27 vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
28     int n = (int)X.size(); int d = (int)w.size(); vector<double> yhat(n, 0.0);
29     for (int i = 0; i < n; ++i) { double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j]*w[j]; yhat[i] = s; }
30     return yhat;
31 }
32 
33 double mse(const vector<double>& yhat, const vector<double>& y) {
34     int n = (int)y.size(); double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i]-y[i]; s += e*e; } return s/n; }
35 
36 vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y, double lambda, double lr=0.05, int iters=400) {
37     int n = (int)X.size(); int d = (int)X[0].size(); vector<double> w(d, 0.0);
38     for (int t = 0; t < iters; ++t) {
39         vector<double> grad(d, 0.0);
40         for (int i = 0; i < n; ++i) {
41             double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
42             double r = yi_hat - y[i];
43             for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
44         }
45         for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j];
46         for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
47     }
48     return w;
49 }
50 
51 int main(){
52     ios::sync_with_stdio(false); cin.tie(nullptr);
53     int R = 100;           // repetitions (datasets)
54     int n = 40;            // sample size per dataset
55     int p = 9;             // polynomial degree (capacity)
56     double lambda = 1e-2;  // regularization (SRM)
57     double noise_std = 0.15; // known noise level
58 
59     // Fixed test grid to estimate bias/variance across x
60     int m = 200; vector<double> xt(m); for (int i=0;i<m;++i) xt[i] = (i+0.5)/m;
61     auto Xt = poly_features(xt, p);
62     vector<double> fstar(m); for (int i=0;i<m;++i) fstar[i] = f_true(xt[i]);
63 
64     vector<vector<double>> preds(R, vector<double>(m, 0.0));
65     vector<double> test_mse_runs(R, 0.0);
66 
67     for (int r = 0; r < R; ++r) {
68         auto s = make_data(n, noise_std, 777 + r);
69         auto X = poly_features(s.x, p);
70         vector<double> w = ridge_gd(X, s.y, lambda, 0.05, 400);
71         vector<double> yhat_t = predict(Xt, w);
72         preds[r] = yhat_t;
73         // approximate test MSE using noisy labels simulated on grid
74         vector<double> y_noisy(m);
75         mt19937 rng(999 + r); normal_distribution<double> noise(0.0, noise_std);
76         for (int i = 0; i < m; ++i) y_noisy[i] = fstar[i] + noise(rng);
77         test_mse_runs[r] = mse(yhat_t, y_noisy);
78     }
79 
80     // Compute bias^2 and variance across the R models at each x
81     vector<double> mean_pred(m, 0.0), var_pred(m, 0.0);
82     for (int i = 0; i < m; ++i) {
83         for (int r = 0; r < R; ++r) mean_pred[i] += preds[r][i];
84         mean_pred[i] /= R;
85         double s2 = 0.0; for (int r = 0; r < R; ++r) { double d = preds[r][i] - mean_pred[i]; s2 += d*d; }
86         var_pred[i] = s2 / R;
87     }
88 
89     // Aggregate components
90     double bias2 = 0.0, var = 0.0; for (int i=0;i<m;++i){ double b = mean_pred[i] - fstar[i]; bias2 += b*b; var += var_pred[i]; }
91     bias2 /= m; var /= m; double noise2 = noise_std*noise_std;
92     double avg_test_mse = accumulate(test_mse_runs.begin(), test_mse_runs.end(), 0.0) / R;
93 
94     cout << fixed << setprecision(6);
95     cout << "Bias^2: " << bias2 << "\n";
96     cout << "Variance: " << var << "\n";
97     cout << "Noise variance: " << noise2 << "\n";
98     cout << "Bias^2 + Var + Noise: " << (bias2 + var + noise2) << "\n";
99     cout << "Avg Test MSE (approx): " << avg_test_mse << "\n";
100     return 0;
101 }
102

We fit many models on different noisy datasets and evaluate predictions on a fixed grid. For each input, we compute the mean prediction (to estimate bias) and the variance across runs. Averaging over the grid yields bias² and variance, and adding the known noise variance approximates the expected squared error. This numerically illustrates the bias–variance trade-off and the decomposition.

Time: O(R · (T · n · p + m · p)), dominated by training across R repetitions.Space: O(n · p + R · m) for features and storing predictions across runs.

Empirical Rademacher Complexity Estimator for Linear Predictors

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix { vector<vector<double>> X; }; // n x d
5 
6 Matrix make_data(int n, int d, unsigned seed=3) {
7     mt19937 rng(seed); normal_distribution<double> nd(0.0, 1.0);
8     Matrix M; M.X.assign(n, vector<double>(d, 0.0));
9     for (int i=0;i<n;++i){ for (int j=0;j<d;++j){ M.X[i][j] = nd(rng); } }
10     return M;
11 }
12 
13 double l2norm(const vector<double>& v){ double s=0.0; for(double x:v) s+=x*x; return sqrt(s); }
14 
15 // Estimate \hat{R}_S(F_B) = (B/n) E_sigma || sum_i sigma_i x_i ||_2 via Monte Carlo
16 // Optionally report the bound BR/sqrt(n) with R = max_i ||x_i||_2
17 int main(){
18     ios::sync_with_stdio(false); cin.tie(nullptr);
19 
20     int n = 500, d = 50; double B = 1.0; int M = 200; // trials
21     Matrix data = make_data(n, d, 11);
22 
23     // Compute R = max ||x_i||
24     double R = 0.0; for (int i=0;i<n;++i){ R = max(R, l2norm(data.X[i])); }
25 
26     mt19937 rng(1234); uniform_int_distribution<int> coin(0,1);
27     double sum_complexity = 0.0;
28     vector<double> s(d, 0.0);
29 
30     for (int t = 0; t < M; ++t) {
31         fill(s.begin(), s.end(), 0.0);
32         for (int i = 0; i < n; ++i) {
33             int bit = coin(rng); double sigma = bit ? 1.0 : -1.0;
34             for (int j = 0; j < d; ++j) s[j] += sigma * data.X[i][j];
35         }
36         double val = (B / n) * l2norm(s);
37         sum_complexity += val;
38     }
39     double rademacher_est = sum_complexity / M;
40     double upper_bound = (B * R) / sqrt((double)n);
41 
42     cout << fixed << setprecision(6);
43     cout << "Empirical Rademacher (linear, norm<=B): " << rademacher_est << "\n";
44     cout << "Theoretical upper bound BR/sqrt(n): " << upper_bound << "\n";
45     return 0;
46 }
47

For the class of linear predictors with L2-norm at most B, the empirical Rademacher complexity equals (B/n) times the expected L2 norm of the signed sum of examples. We approximate this expectation by averaging over M random sign vectors. We also report the standard bound BR/√n using the maximum example norm R, demonstrating how capacity shrinks with more data and tighter norm constraints.

Time: O(M · n · d) to accumulate signed sums over M trials.Space: O(n · d) to store data and O(d) for the accumulator vector.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Dataset {
5	vector<vector<double>> X; // n x d
6	vector<double> y; // n
7	};
8
9	// Generate synthetic linear data: y = w_true^T x + noise
10	Dataset make_linear_data(int n, int d, double noise_std, unsigned seed=42) {
11	mt19937 rng(seed);
12	normal_distribution<double> nd(0.0, 1.0);
13	normal_distribution<double> noise(0.0, noise_std);
14
15	vector<double> w_true(d, 0.0);
16	for (int j = 0; j < d; ++j) w_true[j] = nd(rng); // random ground truth
17
18	Dataset data; data.X.assign(n, vector<double>(d)); data.y.assign(n, 0.0);
19	for (int i = 0; i < n; ++i) {
20	for (int j = 0; j < d; ++j) data.X[i][j] = nd(rng);
21	double yi = 0.0;
22	for (int j = 0; j < d; ++j) yi += w_true[j] * data.X[i][j];
23	yi += noise(rng);
24	data.y[i] = yi;
25	}
26	return data;
27	}
28
29	// Compute predictions: y_hat = X w
30	vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
31	int n = (int)X.size(); int d = (int)w.size();
32	vector<double> yhat(n, 0.0);
33	for (int i = 0; i < n; ++i) {
34	double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j] * w[j];
35	yhat[i] = s;
36	}
37	return yhat;
38	}
39
40	// Mean squared error (empirical risk for squared loss)
41	double mse(const vector<vector<double>>& X, const vector<double>& y, const vector<double>& w) {
42	int n = (int)X.size();
43	vector<double> yhat = predict(X, w);
44	double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i] - y[i]; s += e*e; }
45	return s / n;
46	}
47
48	// Batch gradient descent for ridge regression: minimize (1/n)\|\|Xw - y\|\|^2 + lambda \|\|w\|\|^2
49	vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y,
50	double lambda, double lr=0.05, int iters=200) {
51	int n = (int)X.size(); int d = (int)X[0].size();
52	vector<double> w(d, 0.0);
53	for (int t = 0; t < iters; ++t) {
54	vector<double> grad(d, 0.0);
55	for (int i = 0; i < n; ++i) {
56	double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
57	double r = yi_hat - y[i];
58	for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
59	}
60	for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j]; // ridge penalty gradient
61	for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
62	}
63	return w;
64	}
65
66	// k-fold cross-validation to select lambda (SRM)
67	double kfold_cv(const Dataset& data, const vector<double>& lambdas, int k, double lr, int iters, double& best_lambda) {
68	int n = (int)data.X.size();
69	vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
70	mt19937 rng(123);
71	shuffle(idx.begin(), idx.end(), rng);
72
73	vector<vector<int>> folds(k);
74	for (int i = 0; i < n; ++i) folds[i % k].push_back(idx[i]);
75
76	double best_cv = numeric_limits<double>::infinity();
77	best_lambda = lambdas[0];
78
79	for (double lam : lambdas) {
80	double cv_err = 0.0;
81	for (int f = 0; f < k; ++f) {
82	// Build train/val split
83	vector<vector<double>> Xtr, Xval; vector<double> ytr, yval;
84	for (int g = 0; g < k; ++g) {
85	for (int id : folds[g]) {
86	if (g == f) { Xval.push_back(data.X[id]); yval.push_back(data.y[id]); }
87	else { Xtr.push_back(data.X[id]); ytr.push_back(data.y[id]); }
88	}
89	}
90	vector<double> w = ridge_gd(Xtr, ytr, lam, lr, iters);
91	cv_err += mse(Xval, yval, w);
92	}
93	cv_err /= k;
94	if (cv_err < best_cv) { best_cv = cv_err; best_lambda = lam; }
95	}
96	return best_cv;
97	}
98
99	int main(){
100	ios::sync_with_stdio(false);
101	cin.tie(nullptr);
102
103	int n = 1200, d = 20; double noise_std = 0.5;
104	Dataset data = make_linear_data(n, d, noise_std, 7);
105
106	// Train/test split
107	int n_train = (int)(0.8 * n);
108	vector<vector<double>> Xtr(data.X.begin(), data.X.begin() + n_train);
109	vector<double> ytr(data.y.begin(), data.y.begin() + n_train);
110	vector<vector<double>> Xte(data.X.begin() + n_train, data.X.end());
111	vector<double> yte(data.y.begin() + n_train, data.y.end());
112
113	Dataset dtr{Xtr, ytr};
114
115	vector<double> lambdas = {0.0, 1e-3, 1e-2, 1e-1, 1.0};
116	double best_lambda = lambdas[0];
117	double cv_err = kfold_cv(dtr, lambdas, 5, 0.05, 300, best_lambda);
118
119	vector<double> w = ridge_gd(Xtr, ytr, best_lambda, 0.05, 500);
120	double train_mse = mse(Xtr, ytr, w);
121	double test_mse = mse(Xte, yte, w);
122
123	cout << fixed << setprecision(6);
124	cout << "Best lambda (SRM): " << best_lambda << "\n";
125	cout << "CV MSE: " << cv_err << "\n";
126	cout << "Train MSE (empirical risk): " << train_mse << "\n";
127	cout << "Test MSE (proxy for true risk): " << test_mse << "\n";
128	return 0;
129	}
130

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// True function and noise
5	inline double f_true(double x) { return sin(2.0 * M_PI * x); }
6
7	struct Sample1D { vector<double> x, y; };
8
9	Sample1D make_data(int n, double noise_std, unsigned seed) {
10	mt19937 rng(seed); uniform_real_distribution<double> unif(0.0, 1.0);
11	normal_distribution<double> noise(0.0, noise_std);
12	Sample1D s; s.x.resize(n); s.y.resize(n);
13	for (int i = 0; i < n; ++i) { s.x[i] = unif(rng); s.y[i] = f_true(s.x[i]) + noise(rng); }
14	return s;
15	}
16
17	// Build polynomial features up to degree p (including bias term)
18	vector<vector<double>> poly_features(const vector<double>& x, int p) {
19	int n = (int)x.size(); vector<vector<double>> X(n, vector<double>(p+1, 1.0));
20	for (int i = 0; i < n; ++i) {
21	double powx = 1.0;
22	for (int j = 1; j <= p; ++j) { powx *= x[i]; X[i][j] = powx; }
23	}
24	return X;
25	}
26
27	vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
28	int n = (int)X.size(); int d = (int)w.size(); vector<double> yhat(n, 0.0);
29	for (int i = 0; i < n; ++i) { double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j]*w[j]; yhat[i] = s; }
30	return yhat;
31	}
32
33	double mse(const vector<double>& yhat, const vector<double>& y) {
34	int n = (int)y.size(); double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i]-y[i]; s += e*e; } return s/n; }
35
36	vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y, double lambda, double lr=0.05, int iters=400) {
37	int n = (int)X.size(); int d = (int)X[0].size(); vector<double> w(d, 0.0);
38	for (int t = 0; t < iters; ++t) {
39	vector<double> grad(d, 0.0);
40	for (int i = 0; i < n; ++i) {
41	double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
42	double r = yi_hat - y[i];
43	for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
44	}
45	for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j];
46	for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
47	}
48	return w;
49	}
50
51	int main(){
52	ios::sync_with_stdio(false); cin.tie(nullptr);
53	int R = 100; // repetitions (datasets)
54	int n = 40; // sample size per dataset
55	int p = 9; // polynomial degree (capacity)
56	double lambda = 1e-2; // regularization (SRM)
57	double noise_std = 0.15; // known noise level
58
59	// Fixed test grid to estimate bias/variance across x
60	int m = 200; vector<double> xt(m); for (int i=0;i<m;++i) xt[i] = (i+0.5)/m;
61	auto Xt = poly_features(xt, p);
62	vector<double> fstar(m); for (int i=0;i<m;++i) fstar[i] = f_true(xt[i]);
63
64	vector<vector<double>> preds(R, vector<double>(m, 0.0));
65	vector<double> test_mse_runs(R, 0.0);
66
67	for (int r = 0; r < R; ++r) {
68	auto s = make_data(n, noise_std, 777 + r);
69	auto X = poly_features(s.x, p);
70	vector<double> w = ridge_gd(X, s.y, lambda, 0.05, 400);
71	vector<double> yhat_t = predict(Xt, w);
72	preds[r] = yhat_t;
73	// approximate test MSE using noisy labels simulated on grid
74	vector<double> y_noisy(m);
75	mt19937 rng(999 + r); normal_distribution<double> noise(0.0, noise_std);
76	for (int i = 0; i < m; ++i) y_noisy[i] = fstar[i] + noise(rng);
77	test_mse_runs[r] = mse(yhat_t, y_noisy);
78	}
79
80	// Compute bias^2 and variance across the R models at each x
81	vector<double> mean_pred(m, 0.0), var_pred(m, 0.0);
82	for (int i = 0; i < m; ++i) {
83	for (int r = 0; r < R; ++r) mean_pred[i] += preds[r][i];
84	mean_pred[i] /= R;
85	double s2 = 0.0; for (int r = 0; r < R; ++r) { double d = preds[r][i] - mean_pred[i]; s2 += d*d; }
86	var_pred[i] = s2 / R;
87	}
88
89	// Aggregate components
90	double bias2 = 0.0, var = 0.0; for (int i=0;i<m;++i){ double b = mean_pred[i] - fstar[i]; bias2 += b*b; var += var_pred[i]; }
91	bias2 /= m; var /= m; double noise2 = noise_std*noise_std;
92	double avg_test_mse = accumulate(test_mse_runs.begin(), test_mse_runs.end(), 0.0) / R;
93
94	cout << fixed << setprecision(6);
95	cout << "Bias^2: " << bias2 << "\n";
96	cout << "Variance: " << var << "\n";
97	cout << "Noise variance: " << noise2 << "\n";
98	cout << "Bias^2 + Var + Noise: " << (bias2 + var + noise2) << "\n";
99	cout << "Avg Test MSE (approx): " << avg_test_mse << "\n";
100	return 0;
101	}
102

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix { vector<vector<double>> X; }; // n x d
5
6	Matrix make_data(int n, int d, unsigned seed=3) {
7	mt19937 rng(seed); normal_distribution<double> nd(0.0, 1.0);
8	Matrix M; M.X.assign(n, vector<double>(d, 0.0));
9	for (int i=0;i<n;++i){ for (int j=0;j<d;++j){ M.X[i][j] = nd(rng); } }
10	return M;
11	}
12
13	double l2norm(const vector<double>& v){ double s=0.0; for(double x:v) s+=x*x; return sqrt(s); }
14
15	// Estimate \hat{R}_S(F_B) = (B/n) E_sigma \|\| sum_i sigma_i x_i \|\|_2 via Monte Carlo
16	// Optionally report the bound BR/sqrt(n) with R = max_i \|\|x_i\|\|_2
17	int main(){
18	ios::sync_with_stdio(false); cin.tie(nullptr);
19
20	int n = 500, d = 50; double B = 1.0; int M = 200; // trials
21	Matrix data = make_data(n, d, 11);
22
23	// Compute R = max \|\|x_i\|\|
24	double R = 0.0; for (int i=0;i<n;++i){ R = max(R, l2norm(data.X[i])); }
25
26	mt19937 rng(1234); uniform_int_distribution<int> coin(0,1);
27	double sum_complexity = 0.0;
28	vector<double> s(d, 0.0);
29
30	for (int t = 0; t < M; ++t) {
31	fill(s.begin(), s.end(), 0.0);
32	for (int i = 0; i < n; ++i) {
33	int bit = coin(rng); double sigma = bit ? 1.0 : -1.0;
34	for (int j = 0; j < d; ++j) s[j] += sigma * data.X[i][j];
35	}
36	double val = (B / n) * l2norm(s);
37	sum_complexity += val;
38	}
39	double rademacher_est = sum_complexity / M;
40	double upper_bound = (B * R) / sqrt((double)n);
41
42	cout << fixed << setprecision(6);
43	cout << "Empirical Rademacher (linear, norm<=B): " << rademacher_est << "\n";
44	cout << "Theoretical upper bound BR/sqrt(n): " << upper_bound << "\n";
45	return 0;
46	}
47