šŸ“šTheoryIntermediate

PAC Learning

Key Points

  • •
    PAC learning formalizes when a learner can probably (with probability at least 1āˆ’ and approximately (error at most succeed using a polynomial number of samples.
  • •
    A concept class is PAC learnable if and only if it has finite VC dimension, which measures how expressive the class is.
  • •
    Sample complexity scales like O((VC/ + (log(1/ in the agnostic setting, and roughly O((VCĀ·log(1/ + log(1/ in the realizable setting.
  • •
    Empirical Risk Minimization (ERM) is the core strategy: pick the hypothesis that minimizes training error; with enough samples it generalizes.
  • •
    Agnostic PAC learning drops the assumption that the target is in the hypothesis class and aims to be within ε of the best-in-class error.
  • •
    Computational efficiency (running time) and statistical efficiency (sample complexity) are different; a class can be PAC learnable but computationally intractable.
  • •
    VC dimension controls uniform convergence: empirical errors become close to true errors for all hypotheses simultaneously as samples grow.
  • •
    Simple classes like 1D thresholds and intervals are efficiently and provably PAC learnable; perceptron provides a classic linear separator learner.

Prerequisites

  • →Probability and concentration inequalities — PAC bounds rely on probability over sample draws and tools like Hoeffding’s inequality and union bounds.
  • →Asymptotic notation — Understanding O(Ā·), Θ(Ā·), and sample complexity scaling in ε, Ī“, and VC dimension is essential.
  • →Basic linear algebra — Linear separators and perceptron updates use dot products and norms.
  • →Binary classification and loss functions — PAC learning typically analyzes 0–1 loss; surrogate losses are used for computational tractability.
  • →Sorting and prefix sums — Efficient ERM for thresholds and intervals uses sorting and cumulative counts.

Detailed Explanation

Tap terms for definitions

01Overview

Probably Approximately Correct (PAC) learning is a mathematical framework that explains what it means for an algorithm to learn from data. Instead of demanding perfection, PAC allows a learner to be approximately correct—its prediction error need only be below a tolerance ε—and probably correct—this guarantee should hold with probability at least 1āˆ’Ī“ over the random draw of training data. Two central questions are: how many samples are needed to achieve these guarantees (sample complexity), and how much computation is needed (computational complexity)? A breakthrough result states that a concept class is PAC learnable if and only if it has finite Vapnik–Chervonenkis (VC) dimension, a combinatorial measure of expressiveness. The framework also distinguishes the realizable case, where the data are perfectly explainable by the class, from the agnostic case, where noise or misspecification may prevent perfect fit. In both cases, Empirical Risk Minimization (ERM)—choosing the hypothesis with the smallest training error—forms the backbone of learning, with uniform convergence results ensuring that training error is a good proxy for true error given enough data. PAC learning is foundational for AI/ML because it provides precise conditions under which learning is possible, clarifies trade-offs among accuracy (ε), confidence (Ī“), and data, and highlights the gap that can exist between being statistically learnable and computationally feasible.

02Intuition & Analogies

Imagine you’re trying to guess the rule for passing a toll booth without getting a ticket. You can drive through many times and observe whether you get a ticket (a labeled example). You don’t need a perfect understanding of the rule—just one that keeps tickets rare (error ≤ ε). And you want your strategy to work not just on the drives you tried, but in general—ideally with high confidence (at least 1āˆ’Ī“) that it will keep your future ticket rate low. That’s the spirit of PAC learning: approximately right and probably so. The toll rules could come from different types of policies (the concept class): maybe a simple rule like ā€œticket if speed > threshold,ā€ or a more complex combination of conditions. The more complex the policy family, the more test drives you need to be confident you’ve really learned the pattern; this is what VC dimension measures, akin to how many distinct ways the toll can react to different sets of speeds. Empirical Risk Minimization is like picking the rule that caused you the fewest tickets in your practice runs. If you took enough runs and the policy family isn’t too wild, the rule that worked best on your trials will also work well afterward. In the real world, the true rule might not be exactly in your policy family (agnostic case), or there might be randomness (sensor noise, weather). In that case, the best you can hope for is to match, up to ε, the performance of the best rule inside your chosen family. Finally, even if it’s statistically possible to learn a rule from the data, finding it might be computationally hard—like having enough clues to solve a puzzle but the search space is still massive. PAC learning separates these two axes: how much data you need and how hard it is to compute a learner.

03Formal Definition

Let X be an instance space and C āŠ† {0,1}^X be a concept class (the set of target labelings). A hypothesis class H āŠ† {0,1}^X is used by a learner. For a distribution D over X Ɨ {0,1}, the risk (generalization error) of h ∈ H is R(h) = (h(x) ≠ y). An algorithm A ( learns C using H if there exists a function ( polynomial in 1/ε and 1/Ī“ such that for all D over X Ɨ {0,1}, for all c ∈ C, when A receives ,Ī“) i.i.d. samples S from D labeled by c, with probability at least 1āˆ’Ī“ over S it outputs a hypothesis h ∈ H satisfying R(h) ≤ Efficiently PAC learnable additionally requires that A runs in time polynomial in m, 1/ 1/ and size parameters of c. In the realizable case, we assume ∃h* ∈ H with R(h*) = 0; in the agnostic case, no such assumption is made, and the requirement becomes R(h) ≤ R(h) + ε with . A pivotal theorem (Fundamental Theorem of Statistical Learning) states: H is PAC learnable if and only if VCdim(H) < āˆž. Sample complexity bounds follow from uniform convergence: with high probability, is small, where is empirical error. For finite H, + log(1/ in the realizable case, and + log(1/ in the agnostic case. For infinite H with VC dimension d, these become log(1/ + log(1/ and ))/ε²), respectively.

04When to Use

Use PAC learning to reason about whether your chosen hypothesis class has the capacity to learn your task from a feasible amount of data and what error/confidence trade-offs are possible. It is especially useful when selecting model families (e.g., thresholds, intervals, linear separators, decision trees) and needing guarantees about generalization. In the realizable regime—such as synthetic data generated by a known rule—PAC bounds with ε-linear rates offer tight insights. In realistic, noisy applications—sensor noise, label noise, misspecification—agnostic PAC bounds with ε² dependence capture the data needs more accurately. When your class has finite VC dimension and your algorithm performs ERM or regularized ERM, you can expect uniform convergence, so validation performance approximates test performance as sample size grows. PAC analysis is also valuable when comparing models: a more expressive class might fit better empirically but demand many more samples to generalize. Finally, PAC is a lens for understanding the limits of learning: it highlights when a class is not learnable without assumptions (no-free-lunch) or when computational barriers (e.g., learning DNF under arbitrary distributions) make an otherwise PAC-learnable class intractable without additional structure.

āš ļøCommon Mistakes

• Confusing statistical and computational efficiency: A class with finite VC dimension may be PAC learnable, yet finding the ERM can be NP-hard (e.g., general 0–1 loss linear classification with constraints). Always separate sample complexity from algorithmic tractability. • Misreading ε and Ī“: Smaller ε (higher accuracy) and smaller Ī“ (higher confidence) both demand more samples; reducing one while keeping the other fixed still increases sample complexity. • Overinterpreting asymptotic O(Ā·) bounds: Constants can be large; in practice, use validation to calibrate data needs and consider refined bounds (margins, localized complexities). • Assuming realizability in noisy data: If labels are noisy or the target is outside H, realizable bounds are overly optimistic; switch to agnostic guarantees and algorithms minimizing surrogate losses or performing explicit 0–1 ERM only where tractable. • Ignoring distributional assumptions: PAC guarantees are distribution-free over X, but algorithmic performance (e.g., perceptron convergence) may rely on margin or separability; be explicit about such assumptions when interpreting results. • Equating VC dimension with parameter count: They are related but not identical; some infinite-parameter models have finite VC dimension and vice versa. • Believing more data always fixes overfitting immediately: If H is too expressive relative to m, training error can be misleading; regularization or restricting H is needed so uniform convergence kicks in at practical sample sizes.

Key Formulas

Risk (Generalization Error)

Explanation: Risk is the probability that hypothesis h misclassifies a fresh example drawn from the data distribution D. It is the true test error we ultimately care about.

Empirical Risk

Explanation: Empirical risk is the average error of h on the training sample S of size m. Uniform convergence connects this to the true risk R(h).

PAC Guarantee (Realizable)

Explanation: In the realizable case, the ERM hypothesis has true error at most ε with probability at least 1āˆ’ provided enough samples are used. The probability is over the random draw of the sample.

Finite Class Realizable Sample Complexity

Explanation: For a finite hypothesis class H in the realizable case, this many samples suffice so that ERM achieves error at most ε with probability at least 1āˆ’ It follows from a union bound over H.

VC Realizable Sample Complexity

Explanation: If VCdim(H)=d and realizability holds, sample complexity scales linearly with 1/ε (up to a log(1/ and only logarithmically with 1/ This uses Sauer's lemma and uniform convergence.

VC Agnostic Sample Complexity

Explanation: In the agnostic case, to get within ε of the best-in-class error with probability at least 1āˆ’ the number of samples grows like 1/ times the VC dimension and confidence term.

Sauer's Lemma (Growth Function)

Explanation: The number of distinct labelings H can realize on m points is bounded by this expression when VCdim(H)=d. It is crucial in converting bounds from finite classes to infinite classes.

Hoeffding's Inequality

Explanation: For bounded i.i.d. variables, the empirical average concentrates around the true mean. It underpins many PAC bounds via uniform convergence.

Uniform Convergence via Growth Function

Explanation: Combining Hoeffding with a union bound over the growth function controls the worst-case deviation between empirical and true errors across all hypotheses.

Generalization Gap (VC Bound)

Explanation: With high probability, the true error is close to the empirical error by an amount that shrinks as sample size grows and complexity (VC dimension) decreases.

Perceptron Mistake Bound

Explanation: If data lie in a ball of radius R and are linearly separable with margin the perceptron makes at most (R/ mistakes. This is a classic realizable-case learning guarantee.

Complexity Analysis

The sample complexity bounds do not involve computation, but our concrete learners do. For the 1D threshold ERM, we sort m labeled points and scan candidate thresholds (midpoints between adjacent x-values) under two polarities. Sorting dominates with O(m log m) time and O(m) space; scanning is O(m). This learner is statistically efficient (VC dimension 1) and computationally efficient (polynomial time). For the 1D interval ERM (agnostic), we sort once (O(m log m)) and then examine all O(m²) candidate intervals using prefix sums, yielding O(m²) time and O(m) space. This is still polynomial and illustrates agnostic ERM in a tractable class (VC dimension 2), though the quadratic factor can be heavy for very large m. For the perceptron in , each epoch processes all m points with O(d) work per update; for T epochs (or M mistakes), runtime is O(T m d) in the worst case, but standard analyses bound the number of updates by O((R/ giving O(((R/ d) updates in the separable case. Space for perceptron is O(d) for the weight vector plus O(1) per point if streamed; storing the dataset costs O(m d). Importantly, these computational analyses are distinct from statistical bounds: even when VC-dimension implies learnability with few samples, ERM over complex classes (e.g., 0–1 loss for general linear separators with combinatorial constraints) may be NP-hard. In such cases, surrogate-loss minimization (e.g., logistic or hinge losses) is used to get computationally feasible procedures with PAC-style guarantees via uniform convergence.

Code Examples

ERM for 1D Thresholds with Sample Complexity Helper (Realizable PAC)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct ThresholdModel {
5 double threshold; // cut point t
6 int polarity; // +1: predict 1 if x >= t, -1: predict 1 if x <= t
7};
8
9// Predict label {0,1} given model and x
10int predict(const ThresholdModel &m, double x) {
11 if (m.polarity == +1) return (x >= m.threshold) ? 1 : 0;
12 else return (x <= m.threshold) ? 1 : 0;
13}
14
15// Train ERM threshold on labeled data (x_i, y_i) with y in {0,1}
16ThresholdModel train_threshold_erm(vector<pair<double,int>> data) {
17 // Sort by x
18 sort(data.begin(), data.end(), [](auto &a, auto &b){ return a.first < b.first; });
19 int n = (int)data.size();
20 vector<double> xs(n);
21 vector<int> ys(n);
22 for (int i = 0; i < n; ++i) { xs[i] = data[i].first; ys[i] = data[i].second; }
23
24 // Candidate thresholds: midpoints between consecutive xs, plus extremes
25 vector<double> cand;
26 cand.reserve(n + 1);
27 if (n == 0) return {0.0, +1};
28 // Extreme left and right: a bit outside observed range
29 cand.push_back(xs.front() - 1.0);
30 for (int i = 0; i + 1 < n; ++i) {
31 double mid = (xs[i] + xs[i+1]) / 2.0;
32 cand.push_back(mid);
33 }
34 cand.push_back(xs.back() + 1.0);
35
36 ThresholdModel best{cand[0], +1};
37 int best_err = n + 1;
38
39 // Evaluate each candidate with both polarities
40 for (double t : cand) {
41 for (int pol : {+1, -1}) {
42 ThresholdModel m{t, pol};
43 int err = 0;
44 for (int i = 0; i < n; ++i) {
45 if (predict(m, xs[i]) != ys[i]) ++err;
46 }
47 if (err < best_err) {
48 best_err = err;
49 best = m;
50 }
51 }
52 }
53 return best;
54}
55
56// Compute minimum sample size m for agnostic VC bound: m >= (2/eps^2)*(d*ln(2e/eps) + ln(2/delta))
57// This is a practical concrete instantiation (not the only valid one)
58size_t agnostic_sample_size_vc(int d, double eps, double delta) {
59 if (eps <= 0 || delta <= 0 || delta >= 1) throw invalid_argument("Bad eps/delta");
60 double term = d * log(2.0 * exp(1.0) / eps) + log(2.0 / delta);
61 double m = (2.0 / (eps * eps)) * term;
62 return (size_t)ceil(m);
63}
64
65int main() {
66 // Simple demo dataset (realizable by a threshold at t=3)
67 vector<pair<double,int>> train = {
68 {1.0,0},{1.5,0},{2.0,0},{2.5,0},{3.0,1},{3.5,1},{4.0,1},{5.0,1}
69 };
70
71 ThresholdModel model = train_threshold_erm(train);
72 cout << fixed << setprecision(3);
73 cout << "Learned threshold t=" << model.threshold << ", polarity=" << model.polarity << "\n";
74
75 // Evaluate training error
76 int err = 0; for (auto &p : train) err += (predict(model, p.first) != p.second);
77 cout << "Training error = " << err << "/" << train.size() << " (" << (100.0*err/train.size()) << "%)\n";
78
79 // Show sample complexity suggestion for VC dimension d=1 (thresholds)
80 int d = 1; double eps = 0.1, delta = 0.05;
81 size_t m_needed = agnostic_sample_size_vc(d, eps, delta);
82 cout << "Agnostic VC sample size suggestion for d=1, eps=0.1, delta=0.05: m >= " << m_needed << "\n";
83
84 // Quick test predictions
85 vector<double> test_x = {2.2, 2.9, 3.1, 4.8};
86 for (double x : test_x) {
87 cout << "x=" << x << " -> pred=" << predict(model, x) << "\n";
88 }
89 return 0;
90}
91

This program implements ERM for 1D threshold classifiers, a classic PAC-learnable class with VC dimension 1. It searches all midpoints between sorted samples and both label polarities, returning the hypothesis with minimal training error. It also includes a helper that estimates the agnostic VC sample complexity for given ε, Γ, and VC dimension d, illustrating how theoretical bounds guide required data size.

Time: O(m log m) for training (sorting dominates); O(m) scanning thresholds; O(1) per predictionSpace: O(m) to store and sort data; O(1) model size
Perceptron Learning (Realizable Case with Margin-Based Guarantee)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Perceptron {
5 vector<double> w; // weights
6 double b; // bias
7 double lr; // learning rate
8
9 Perceptron(int d, double lr_=1.0) : w(d, 0.0), b(0.0), lr(lr_) {}
10
11 int predict(const vector<double>& x) const {
12 double s = inner_product(w.begin(), w.end(), x.begin(), 0.0) + b;
13 return s >= 0 ? +1 : -1;
14 }
15
16 int fit(const vector<vector<double>>& X, const vector<int>& y, int max_epochs=1000) {
17 int n = (int)X.size();
18 int d = (int)w.size();
19 int updates = 0;
20 for (int epoch = 0; epoch < max_epochs; ++epoch) {
21 int mistakes = 0;
22 for (int i = 0; i < n; ++i) {
23 int pred = predict(X[i]);
24 if (pred != y[i]) {
25 // Update: w <- w + lr * y_i * x_i, b <- b + lr * y_i
26 for (int j = 0; j < d; ++j) w[j] += lr * y[i] * X[i][j];
27 b += lr * y[i];
28 ++updates; ++mistakes;
29 }
30 }
31 if (mistakes == 0) break; // converged on separable data
32 }
33 return updates;
34 }
35};
36
37int main(){
38 // Simple linearly separable data in R^2
39 vector<vector<double>> X = {
40 {2.0, 1.0}, {1.0, 1.5}, {2.5, 2.0}, {3.0, 1.0}, // class +1
41 {-1.0, -1.0}, {-2.0, -0.5}, {-1.5, -2.0}, {-2.5, -1.5} // class -1
42 };
43 vector<int> y = {+1,+1,+1,+1, -1,-1,-1,-1};
44
45 Perceptron clf(2, 1.0);
46 int updates = clf.fit(X, y, 100);
47 cout << "Updates: " << updates << "\n";
48 cout << fixed << setprecision(3);
49 cout << "w = [" << clf.w[0] << ", " << clf.w[1] << "], b = " << clf.b << "\n";
50
51 // Evaluate training accuracy
52 int correct = 0;
53 for (size_t i = 0; i < X.size(); ++i) correct += (clf.predict(X[i]) == y[i]);
54 cout << "Training accuracy: " << (100.0*correct/X.size()) << "%\n";
55 return 0;
56}
57

This is the classic perceptron algorithm for linear separators. In the realizable (separable) setting, perceptron finds a consistent classifier and enjoys a mistake bound of O((R/γ)²), aligning with PAC-style guarantees. It illustrates computational efficiency (simple updates) alongside statistical conditions (separability and margin).

Time: O(T m d), where T is number of epochs (or total updates) and d is dimensionSpace: O(d) for weights; O(m d) if storing the dataset
Agnostic ERM for 1D Intervals (VC Dimension 2)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct IntervalModel {
5 double left, right; // predict 1 if x in [left, right], else 0
6};
7
8int predict(const IntervalModel &m, double x){
9 return (x >= m.left && x <= m.right) ? 1 : 0;
10}
11
12// Train agnostic ERM over all intervals [x_i, x_j]
13IntervalModel train_interval_erm(vector<pair<double,int>> data) {
14 int n = (int)data.size();
15 sort(data.begin(), data.end(), [](auto &a, auto &b){ return a.first < b.first; });
16 vector<double> xs(n);
17 vector<int> ys(n);
18 for (int i = 0; i < n; ++i) { xs[i] = data[i].first; ys[i] = data[i].second; }
19
20 // Prefix sums for counts of positives and negatives
21 vector<int> pref_pos(n+1,0), pref_neg(n+1,0);
22 for (int i = 0; i < n; ++i) {
23 pref_pos[i+1] = pref_pos[i] + (ys[i]==1);
24 pref_neg[i+1] = pref_neg[i] + (ys[i]==0);
25 }
26 auto cnt_pos = [&](int l, int r){ return pref_pos[r+1] - pref_pos[l]; };
27 auto cnt_neg = [&](int l, int r){ return pref_neg[r+1] - pref_neg[l]; };
28
29 // Candidate interval endpoints chosen among data points; use midpoints for prediction boundaries
30 int best_err = n + 1; IntervalModel best{0,0};
31 // Also consider empty interval (always predict 0) and full interval (always predict 1)
32 int err_all0 = pref_pos[n];
33 if (err_all0 < best_err) { best_err = err_all0; best = {+numeric_limits<double>::infinity(), -numeric_limits<double>::infinity()}; }
34 int err_all1 = pref_neg[n];
35 if (err_all1 < best_err) { best_err = err_all1; best = {-numeric_limits<double>::infinity(), +numeric_limits<double>::infinity()}; }
36
37 for (int i = 0; i < n; ++i) {
38 for (int j = i; j < n; ++j) {
39 // Errors: negatives inside [i..j] + positives outside (total_pos - pos_in)
40 int neg_in = cnt_neg(i, j);
41 int pos_in = cnt_pos(i, j);
42 int pos_out = pref_pos[n] - pos_in;
43 int err = neg_in + pos_out;
44 if (err < best_err) {
45 best_err = err;
46 // Set boundaries between points for deterministic predictions
47 double left = (i==0) ? xs[i] - 1.0 : (xs[i-1] + xs[i]) / 2.0;
48 double right = (j==n-1) ? xs[j] + 1.0 : (xs[j] + xs[j+1]) / 2.0;
49 best = {left, right};
50 }
51 }
52 }
53 return best;
54}
55
56int main(){
57 // Mixed/noisy labels: not perfectly representable by a single interval
58 vector<pair<double,int>> train = {
59 {0.0,0},{0.5,0},{1.0,1},{1.5,1},{2.0,1},{2.5,0},{3.0,0},{3.5,1},{4.0,0}
60 };
61
62 IntervalModel m = train_interval_erm(train);
63 cout << fixed << setprecision(3);
64 cout << "Learned interval: [" << m.left << ", " << m.right << "]\n";
65
66 int err = 0; for (auto &p : train) err += (predict(m, p.first) != p.second);
67 cout << "Training error = " << err << "/" << train.size() << " (" << (100.0*err/train.size()) << "%)\n";
68
69 vector<double> test_x = {0.25, 1.25, 2.25, 3.25};
70 for (double x : test_x) cout << "x=" << x << " -> pred=" << predict(m, x) << "\n";
71 return 0;
72}
73

This program performs agnostic ERM over the class of 1D intervals (predict 1 inside, 0 outside). It evaluates all O(m²) intervals efficiently using prefix sums to compute the number of misclassifications for each candidate. Intervals have VC dimension 2, so this class is PAC learnable; the algorithm is polynomial-time and demonstrates agnostic learning because it minimizes empirical 0–1 loss without assuming realizability.

Time: O(m log m + m²) due to sorting and scanning all intervalsSpace: O(m) for storing data and prefix sums; O(1) model size