📚TheoryAdvanced

Statistical Learning Theory

Key Points

  • Statistical learning theory explains why a model that fits training data can still predict well on unseen data by relating true risk to empirical risk plus a complexity term.
  • Empirical Risk Minimization (ERM) picks the model that minimizes average loss on the sample, while Structural Risk Minimization (SRM) balances fit and complexity (e.g., via regularization).
  • Generalization bounds state that with high probability R(f) ≤ R̂(f) + complexity term, where the term depends on VC dimension, Rademacher complexity, or covering numbers.
  • Sample complexity scales roughly like n = O(d/) to achieve an -accurate classifier for VC dimension d, ignoring log factors.
  • Bias-variance decomposition says expected squared error splits into Bias² + Variance + Noise, explaining underfitting vs overfitting.
  • Rademacher complexity measures how well a model class can correlate with random noise; lower values imply better generalization.
  • Cross-validation is a practical SRM tool that approximates the complexity penalty by holding out data rather than relying on theory constants.
  • C++ implementations can illustrate ERM, SRM, bias-variance, and empirical Rademacher complexity with synthetic data and simple optimizers.

Prerequisites

  • Probability and ExpectationRisk, empirical risk, and generalization bounds are expressed as expectations and probabilities.
  • Concentration InequalitiesVC and Rademacher bounds rely on inequalities like Hoeffding and McDiarmid.
  • Linear AlgebraMany models (linear predictors, norms) and proofs (norm bounds, margins) use vector and matrix concepts.
  • Optimization (Gradient Descent)Practical ERM uses convex surrogates optimized by gradient-based methods.
  • Calculus and Lipschitz ContinuityRademacher-based bounds require Lipschitz properties to move from function classes to loss classes.
  • Machine Learning BasicsUnderstanding losses, training/test splits, and cross-validation grounds the theory in practice.
  • Programming in C++Implementing ERM, SRM, and experiments requires comfort with vectors, loops, and numerics.
  • Matrix ComputationsClosed-form solutions and efficient implementations often use linear algebra routines.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Why does a model that performs brilliantly on your homework sometimes bomb the exam? Statistical learning theory gives a principled answer: it quantifies how training performance translates to test performance. Concept: At its core, the theory studies the gap between true risk (expected loss on new data) and empirical risk (average loss on the sample you trained on). This gap is controlled by the richness of your hypothesis class—the set of models you allow yourself to choose from. Richer classes can fit more patterns, including noise, so they need more data to generalize. The theory introduces complexity measures such as VC dimension, Rademacher complexity, and covering numbers to capture this richness and derive probability guarantees. Example: If you let a classifier draw arbitrarily wiggly boundaries, it can perfectly separate almost any finite training set (zero empirical risk) but may fail on new points. A linear classifier, being simpler, may not fit training data perfectly but often generalizes better with limited samples. Statistical learning theory not only formalizes this intuition but also gives sample complexity estimates like n = O(d/ε²), says how to control complexity (regularization, margins), and explains why practical procedures like cross-validation work as proxies for theoretical trade-offs.

02Intuition & Analogies

Imagine two students preparing for a math contest. Student A memorizes solutions to past problems (high-capacity strategy), while Student B learns the underlying theorems and problem-solving patterns (moderate capacity). On the practice set (training data), A might outperform B, but on the actual contest (test data), B’s understanding tends to transfer better. Statistical learning theory captures this difference by asking: how much of what you learned is real pattern versus memorized noise? The more ways you can explain the training data (the larger the hypothesis class), the easier it is to memorize quirks—like learning every past problem’s answer key. Another analogy: Fitting a curve through points is like choosing the right tool to trace a road. A rigid ruler (low capacity) can’t follow tight bends (high bias), while a super-flexible spline (high capacity) can follow every bump, even potholes (high variance). The sweet spot is a tool that is flexible enough to capture the road’s general shape but not so flexible that it hugs every imperfection. SRM is like carrying a toolkit of rulers with increasing flexibility and selecting the one that balances road-fit with smoothness using a validation check. Random labels thought experiment: If your class can perfectly fit random labels, it has high capacity—akin to a parrot repeating any phrase it hears. Rademacher complexity formalizes this by measuring how well your model class can correlate with random coin flips assigned to your data. Low correlation with randomness is a sign of true learning. Bias-variance decomposition completes the picture by explaining that errors come from systematic mismatch (bias), instability across datasets (variance), and unavoidable measurement noise.

03Formal Definition

Let (X,Y) P be an unknown distribution on inputs X and outputs Y. A learning algorithm selects f from a hypothesis class to minimize expected loss (risk) R(f) = [L(f(X),Y)], which is unobservable. We use the empirical risk on a sample S = \{(,)\}_{i=1}^n: \hat (f) = L(f(),). Empirical Risk Minimization (ERM) chooses \hat f \hat (f). Generalization focuses on bounding R(\hat f) in terms of \hat (\hat f) plus a complexity term depending on , n, and confidence . Complexity measures include: (1) VC dimension () for binary classification, yielding uniform convergence bounds of the form R(f) \hat R(f) + O\!\left(\sqrt{}\right). (2) Rademacher complexity () = _\sigma\left[ f()\right], which adapts to the data and applies to real-valued losses via Lipschitz contraction. (3) Covering numbers N(,,\) that control how finely must be approximated. Structural Risk Minimization constructs nested classes , and selects k to minimize a bound or a validation estimate: \hat R(\hat ) + (). For squared loss, expected error decomposes as [(\hat f(X)-Y)^2] = + + , clarifying under- and over-fitting.

04When to Use

  • Model selection and regularization: When choosing among linear models of different feature expansions, tree depths, or neural network widths, SRM guides you to prefer models that reduce empirical risk without exploding complexity. Practically, this is cross-validation, AIC/BIC, or explicit penalties (L2/L1).
  • Deciding if you have enough data: Use VC or Rademacher-based estimates to reason about how error should shrink with more samples, or to justify data collection needs: n \propto d/\epsilon^2 roughly for classification.
  • Auditing generalization: If a model fits random labels or shows near-zero training error with unstable validation error, theory flags high capacity and potential overfitting. Estimating empirical Rademacher complexity (or doing the random labels test) is an actionable diagnostic.
  • Designing algorithms: Margin-based bounds motivate large-margin classifiers (SVM), norm constraints motivate weight decay, and Lipschitz arguments motivate gradient clipping. For linear predictors, bounds in terms of |w| and |x| explain why feature scaling and norm constraints help.
  • Communicating guarantees: In safety-critical or regulated settings, high-probability bounds provide interpretable statements like “with probability at least 1-\delta, test error is within \epsilon of training error,” clarifying risks to stakeholders. Use these to set conservative thresholds and data requirements.

⚠️Common Mistakes

  • Equating low training loss with good generalization: Without accounting for complexity, ERM can overfit. Always report held-out or cross-validation performance.
  • Misreading bounds as tight predictions: Generalization bounds often have loose constants and are worst-case. Use them for trends and design, not exact forecasts of test error.
  • Ignoring assumptions (i.i.d., bounded loss): Many results assume independent, identically distributed samples and bounded or sub-Gaussian losses. Distribution shift or heavy tails can invalidate guarantees; reconsider the setup or use robust methods.
  • Conflating sample and computational complexity: A class may have good sample complexity but be computationally hard to optimize (e.g., ERM for 0–1 loss). Choose surrogate losses and tractable algorithms.
  • Data leakage during SRM: Using test data to pick hyperparameters invalidates guarantees and inflates performance. Use proper k-fold CV or a separate validation set.
  • Overfitting the validation set: Excessive hyperparameter search can overfit validation; mitigate with nested CV or a final untouched test set.
  • Misusing VC dimension: VC applies directly to binary classification with 0–1 loss; for regression or margin-based losses, prefer Rademacher, covering numbers, or margin bounds.
  • Forgetting scaling and norms: Norm-based bounds motivate feature scaling and weight decay; unscaled features can inflate effective capacity and harm both theory and practice.

Key Formulas

True Risk

Explanation: The expected loss over the unknown data distribution. It measures real-world performance but is unobservable directly.

Empirical Risk

Explanation: The average loss on the training sample. It is computable and serves as a proxy for true risk.

ERM Rule

Explanation: Empirical Risk Minimization selects the function with the lowest empirical risk within the hypothesis class.

Hoeffding (finite class, via union bound)

Explanation: For bounded losses and finite classes, deviations between true and empirical risk are exponentially unlikely in n. Union bound extends this across the class.

VC Generalization Bound (0–1 loss)

Explanation: With probability at least 1-, every classifier in a class of VC dimension d has true risk within this margin of empirical risk.

Sample Complexity (VC)

Explanation: To achieve error at most with confidence 1-, the required sample size grows linearly with d and as 1/ (up to logs).

Empirical Rademacher Complexity

Explanation: Measures how well the class can fit random 1 signs on the given sample, adapting to data geometry.

Rademacher Bound (L-Lipschitz loss)

Explanation: If the loss is L-Lipschitz and bounded, true risk is close to empirical risk plus a term scaling with Rademacher complexity and confidence.

Linear Predictors (Norm-Bounded)

Explanation: For linear functions with \_2 B and \_2 R, Rademacher complexity decays as 1/. This motivates L2 regularization and feature scaling.

Bias–Variance Decomposition

Explanation: For squared loss, expected error splits into irreducible noise, variance across training sets, and squared bias from model misspecification.

Sauer–Shelah Lemma

Explanation: The number of distinct labelings realizable on n points by a class of VC dimension d is polynomially bounded, enabling VC generalization bounds.

SRM Objective

Explanation: Augments empirical risk with a penalty that increases with model complexity (e.g., norm or number of parameters) to find a good trade-off.

Complexity Analysis

Statistical learning theory separates sample complexity from computational complexity. Sample complexity describes how many examples n are needed to achieve accuracy ε with confidence 1− for binary classification with VC dimension d, n typically scales like O((d log(1/ Rademacher-based bounds replace d with a data-dependent quantity that often scales as O(BR/√n) for linear predictors with weight norm bound B and feature radius R. Thus, doubling data roughly shrinks the estimation error by a factor of √2 when other factors are fixed. Computationally, ERM over complex classes may be intractable (e.g., minimizing 0–1 loss is NP-hard). Practical learners minimize convex surrogates (hinge, logistic, squared loss) with algorithms like gradient descent or stochastic gradient descent (SGD). For a dataset of size n and d features, one pass of batch gradient descent costs O(nd) time and O(d) space; T iterations cost O(Tnd). Cross-validation for SRM multiplies this cost by k (folds) and by the number of hyperparameters evaluated. For example, k-fold CV over L choices of λ for ridge regression costs O(kL T n d). Estimating empirical Rademacher complexity for linear predictors via Monte Carlo requires drawing M Rademacher vectors and computing M norms of ∑ : time O(M n d) and space O(d). Bias–variance experiments with R repetitions over m test points cost O(R T n d + R m d). Overall, while theoretical guarantees guide design, practical implementations must balance optimization time, memory, and the overhead of validation or complexity estimation.

Code Examples

ERM and SRM with Ridge Regression and k-fold Cross-Validation
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Dataset {
5 vector<vector<double>> X; // n x d
6 vector<double> y; // n
7};
8
9// Generate synthetic linear data: y = w_true^T x + noise
10Dataset make_linear_data(int n, int d, double noise_std, unsigned seed=42) {
11 mt19937 rng(seed);
12 normal_distribution<double> nd(0.0, 1.0);
13 normal_distribution<double> noise(0.0, noise_std);
14
15 vector<double> w_true(d, 0.0);
16 for (int j = 0; j < d; ++j) w_true[j] = nd(rng); // random ground truth
17
18 Dataset data; data.X.assign(n, vector<double>(d)); data.y.assign(n, 0.0);
19 for (int i = 0; i < n; ++i) {
20 for (int j = 0; j < d; ++j) data.X[i][j] = nd(rng);
21 double yi = 0.0;
22 for (int j = 0; j < d; ++j) yi += w_true[j] * data.X[i][j];
23 yi += noise(rng);
24 data.y[i] = yi;
25 }
26 return data;
27}
28
29// Compute predictions: y_hat = X w
30vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
31 int n = (int)X.size(); int d = (int)w.size();
32 vector<double> yhat(n, 0.0);
33 for (int i = 0; i < n; ++i) {
34 double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j] * w[j];
35 yhat[i] = s;
36 }
37 return yhat;
38}
39
40// Mean squared error (empirical risk for squared loss)
41double mse(const vector<vector<double>>& X, const vector<double>& y, const vector<double>& w) {
42 int n = (int)X.size();
43 vector<double> yhat = predict(X, w);
44 double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i] - y[i]; s += e*e; }
45 return s / n;
46}
47
48// Batch gradient descent for ridge regression: minimize (1/n)||Xw - y||^2 + lambda ||w||^2
49vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y,
50 double lambda, double lr=0.05, int iters=200) {
51 int n = (int)X.size(); int d = (int)X[0].size();
52 vector<double> w(d, 0.0);
53 for (int t = 0; t < iters; ++t) {
54 vector<double> grad(d, 0.0);
55 for (int i = 0; i < n; ++i) {
56 double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
57 double r = yi_hat - y[i];
58 for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
59 }
60 for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j]; // ridge penalty gradient
61 for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
62 }
63 return w;
64}
65
66// k-fold cross-validation to select lambda (SRM)
67double kfold_cv(const Dataset& data, const vector<double>& lambdas, int k, double lr, int iters, double& best_lambda) {
68 int n = (int)data.X.size();
69 vector<int> idx(n); iota(idx.begin(), idx.end(), 0);
70 mt19937 rng(123);
71 shuffle(idx.begin(), idx.end(), rng);
72
73 vector<vector<int>> folds(k);
74 for (int i = 0; i < n; ++i) folds[i % k].push_back(idx[i]);
75
76 double best_cv = numeric_limits<double>::infinity();
77 best_lambda = lambdas[0];
78
79 for (double lam : lambdas) {
80 double cv_err = 0.0;
81 for (int f = 0; f < k; ++f) {
82 // Build train/val split
83 vector<vector<double>> Xtr, Xval; vector<double> ytr, yval;
84 for (int g = 0; g < k; ++g) {
85 for (int id : folds[g]) {
86 if (g == f) { Xval.push_back(data.X[id]); yval.push_back(data.y[id]); }
87 else { Xtr.push_back(data.X[id]); ytr.push_back(data.y[id]); }
88 }
89 }
90 vector<double> w = ridge_gd(Xtr, ytr, lam, lr, iters);
91 cv_err += mse(Xval, yval, w);
92 }
93 cv_err /= k;
94 if (cv_err < best_cv) { best_cv = cv_err; best_lambda = lam; }
95 }
96 return best_cv;
97}
98
99int main(){
100 ios::sync_with_stdio(false);
101 cin.tie(nullptr);
102
103 int n = 1200, d = 20; double noise_std = 0.5;
104 Dataset data = make_linear_data(n, d, noise_std, 7);
105
106 // Train/test split
107 int n_train = (int)(0.8 * n);
108 vector<vector<double>> Xtr(data.X.begin(), data.X.begin() + n_train);
109 vector<double> ytr(data.y.begin(), data.y.begin() + n_train);
110 vector<vector<double>> Xte(data.X.begin() + n_train, data.X.end());
111 vector<double> yte(data.y.begin() + n_train, data.y.end());
112
113 Dataset dtr{Xtr, ytr};
114
115 vector<double> lambdas = {0.0, 1e-3, 1e-2, 1e-1, 1.0};
116 double best_lambda = lambdas[0];
117 double cv_err = kfold_cv(dtr, lambdas, 5, 0.05, 300, best_lambda);
118
119 vector<double> w = ridge_gd(Xtr, ytr, best_lambda, 0.05, 500);
120 double train_mse = mse(Xtr, ytr, w);
121 double test_mse = mse(Xte, yte, w);
122
123 cout << fixed << setprecision(6);
124 cout << "Best lambda (SRM): " << best_lambda << "\n";
125 cout << "CV MSE: " << cv_err << "\n";
126 cout << "Train MSE (empirical risk): " << train_mse << "\n";
127 cout << "Test MSE (proxy for true risk): " << test_mse << "\n";
128 return 0;
129}
130

We generate synthetic linear data and minimize empirical risk (MSE) with ridge (L2) regularization using batch gradient descent. We perform k-fold cross-validation over a grid of lambdas to implement SRM: pick the class/penalty that best balances fit and complexity. Finally, we train on the full training set with the chosen lambda and report training (empirical) and test (proxy for true) risks.

Time: O(k · L · T · n · d) for cross-validation (k folds, L lambdas, T iterations), plus O(T · n · d) for final training.Space: O(n · d) to store data and O(d) for model/gradients.
Bias–Variance Decomposition Experiment (1D Polynomial Ridge Regression)
1#include <bits/stdc++.h>
2using namespace std;
3
4// True function and noise
5inline double f_true(double x) { return sin(2.0 * M_PI * x); }
6
7struct Sample1D { vector<double> x, y; };
8
9Sample1D make_data(int n, double noise_std, unsigned seed) {
10 mt19937 rng(seed); uniform_real_distribution<double> unif(0.0, 1.0);
11 normal_distribution<double> noise(0.0, noise_std);
12 Sample1D s; s.x.resize(n); s.y.resize(n);
13 for (int i = 0; i < n; ++i) { s.x[i] = unif(rng); s.y[i] = f_true(s.x[i]) + noise(rng); }
14 return s;
15}
16
17// Build polynomial features up to degree p (including bias term)
18vector<vector<double>> poly_features(const vector<double>& x, int p) {
19 int n = (int)x.size(); vector<vector<double>> X(n, vector<double>(p+1, 1.0));
20 for (int i = 0; i < n; ++i) {
21 double powx = 1.0;
22 for (int j = 1; j <= p; ++j) { powx *= x[i]; X[i][j] = powx; }
23 }
24 return X;
25}
26
27vector<double> predict(const vector<vector<double>>& X, const vector<double>& w) {
28 int n = (int)X.size(); int d = (int)w.size(); vector<double> yhat(n, 0.0);
29 for (int i = 0; i < n; ++i) { double s = 0.0; for (int j = 0; j < d; ++j) s += X[i][j]*w[j]; yhat[i] = s; }
30 return yhat;
31}
32
33double mse(const vector<double>& yhat, const vector<double>& y) {
34 int n = (int)y.size(); double s = 0.0; for (int i = 0; i < n; ++i) { double e = yhat[i]-y[i]; s += e*e; } return s/n; }
35
36vector<double> ridge_gd(const vector<vector<double>>& X, const vector<double>& y, double lambda, double lr=0.05, int iters=400) {
37 int n = (int)X.size(); int d = (int)X[0].size(); vector<double> w(d, 0.0);
38 for (int t = 0; t < iters; ++t) {
39 vector<double> grad(d, 0.0);
40 for (int i = 0; i < n; ++i) {
41 double yi_hat = 0.0; for (int j = 0; j < d; ++j) yi_hat += X[i][j]*w[j];
42 double r = yi_hat - y[i];
43 for (int j = 0; j < d; ++j) grad[j] += (2.0/n) * r * X[i][j];
44 }
45 for (int j = 0; j < d; ++j) grad[j] += 2.0 * lambda * w[j];
46 for (int j = 0; j < d; ++j) w[j] -= lr * grad[j];
47 }
48 return w;
49}
50
51int main(){
52 ios::sync_with_stdio(false); cin.tie(nullptr);
53 int R = 100; // repetitions (datasets)
54 int n = 40; // sample size per dataset
55 int p = 9; // polynomial degree (capacity)
56 double lambda = 1e-2; // regularization (SRM)
57 double noise_std = 0.15; // known noise level
58
59 // Fixed test grid to estimate bias/variance across x
60 int m = 200; vector<double> xt(m); for (int i=0;i<m;++i) xt[i] = (i+0.5)/m;
61 auto Xt = poly_features(xt, p);
62 vector<double> fstar(m); for (int i=0;i<m;++i) fstar[i] = f_true(xt[i]);
63
64 vector<vector<double>> preds(R, vector<double>(m, 0.0));
65 vector<double> test_mse_runs(R, 0.0);
66
67 for (int r = 0; r < R; ++r) {
68 auto s = make_data(n, noise_std, 777 + r);
69 auto X = poly_features(s.x, p);
70 vector<double> w = ridge_gd(X, s.y, lambda, 0.05, 400);
71 vector<double> yhat_t = predict(Xt, w);
72 preds[r] = yhat_t;
73 // approximate test MSE using noisy labels simulated on grid
74 vector<double> y_noisy(m);
75 mt19937 rng(999 + r); normal_distribution<double> noise(0.0, noise_std);
76 for (int i = 0; i < m; ++i) y_noisy[i] = fstar[i] + noise(rng);
77 test_mse_runs[r] = mse(yhat_t, y_noisy);
78 }
79
80 // Compute bias^2 and variance across the R models at each x
81 vector<double> mean_pred(m, 0.0), var_pred(m, 0.0);
82 for (int i = 0; i < m; ++i) {
83 for (int r = 0; r < R; ++r) mean_pred[i] += preds[r][i];
84 mean_pred[i] /= R;
85 double s2 = 0.0; for (int r = 0; r < R; ++r) { double d = preds[r][i] - mean_pred[i]; s2 += d*d; }
86 var_pred[i] = s2 / R;
87 }
88
89 // Aggregate components
90 double bias2 = 0.0, var = 0.0; for (int i=0;i<m;++i){ double b = mean_pred[i] - fstar[i]; bias2 += b*b; var += var_pred[i]; }
91 bias2 /= m; var /= m; double noise2 = noise_std*noise_std;
92 double avg_test_mse = accumulate(test_mse_runs.begin(), test_mse_runs.end(), 0.0) / R;
93
94 cout << fixed << setprecision(6);
95 cout << "Bias^2: " << bias2 << "\n";
96 cout << "Variance: " << var << "\n";
97 cout << "Noise variance: " << noise2 << "\n";
98 cout << "Bias^2 + Var + Noise: " << (bias2 + var + noise2) << "\n";
99 cout << "Avg Test MSE (approx): " << avg_test_mse << "\n";
100 return 0;
101}
102

We fit many models on different noisy datasets and evaluate predictions on a fixed grid. For each input, we compute the mean prediction (to estimate bias) and the variance across runs. Averaging over the grid yields bias² and variance, and adding the known noise variance approximates the expected squared error. This numerically illustrates the bias–variance trade-off and the decomposition.

Time: O(R · (T · n · p + m · p)), dominated by training across R repetitions.Space: O(n · p + R · m) for features and storing predictions across runs.
Empirical Rademacher Complexity Estimator for Linear Predictors
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix { vector<vector<double>> X; }; // n x d
5
6Matrix make_data(int n, int d, unsigned seed=3) {
7 mt19937 rng(seed); normal_distribution<double> nd(0.0, 1.0);
8 Matrix M; M.X.assign(n, vector<double>(d, 0.0));
9 for (int i=0;i<n;++i){ for (int j=0;j<d;++j){ M.X[i][j] = nd(rng); } }
10 return M;
11}
12
13double l2norm(const vector<double>& v){ double s=0.0; for(double x:v) s+=x*x; return sqrt(s); }
14
15// Estimate \hat{R}_S(F_B) = (B/n) E_sigma || sum_i sigma_i x_i ||_2 via Monte Carlo
16// Optionally report the bound BR/sqrt(n) with R = max_i ||x_i||_2
17int main(){
18 ios::sync_with_stdio(false); cin.tie(nullptr);
19
20 int n = 500, d = 50; double B = 1.0; int M = 200; // trials
21 Matrix data = make_data(n, d, 11);
22
23 // Compute R = max ||x_i||
24 double R = 0.0; for (int i=0;i<n;++i){ R = max(R, l2norm(data.X[i])); }
25
26 mt19937 rng(1234); uniform_int_distribution<int> coin(0,1);
27 double sum_complexity = 0.0;
28 vector<double> s(d, 0.0);
29
30 for (int t = 0; t < M; ++t) {
31 fill(s.begin(), s.end(), 0.0);
32 for (int i = 0; i < n; ++i) {
33 int bit = coin(rng); double sigma = bit ? 1.0 : -1.0;
34 for (int j = 0; j < d; ++j) s[j] += sigma * data.X[i][j];
35 }
36 double val = (B / n) * l2norm(s);
37 sum_complexity += val;
38 }
39 double rademacher_est = sum_complexity / M;
40 double upper_bound = (B * R) / sqrt((double)n);
41
42 cout << fixed << setprecision(6);
43 cout << "Empirical Rademacher (linear, norm<=B): " << rademacher_est << "\n";
44 cout << "Theoretical upper bound BR/sqrt(n): " << upper_bound << "\n";
45 return 0;
46}
47

For the class of linear predictors with L2-norm at most B, the empirical Rademacher complexity equals (B/n) times the expected L2 norm of the signed sum of examples. We approximate this expectation by averaging over M random sign vectors. We also report the standard bound BR/√n using the maximum example norm R, demonstrating how capacity shrinks with more data and tighter norm constraints.

Time: O(M · n · d) to accumulate signed sums over M trials.Space: O(n · d) to store data and O(d) for the accumulator vector.