PAC Learning
Key Points
- ā¢PAC learning formalizes when a learner can probably (with probability at least 1ā and approximately (error at most succeed using a polynomial number of samples.
- ā¢A concept class is PAC learnable if and only if it has finite VC dimension, which measures how expressive the class is.
- ā¢Sample complexity scales like O((VC/ + (log(1/ in the agnostic setting, and roughly O((VCĀ·log(1/ + log(1/ in the realizable setting.
- ā¢Empirical Risk Minimization (ERM) is the core strategy: pick the hypothesis that minimizes training error; with enough samples it generalizes.
- ā¢Agnostic PAC learning drops the assumption that the target is in the hypothesis class and aims to be within ε of the best-in-class error.
- ā¢Computational efficiency (running time) and statistical efficiency (sample complexity) are different; a class can be PAC learnable but computationally intractable.
- ā¢VC dimension controls uniform convergence: empirical errors become close to true errors for all hypotheses simultaneously as samples grow.
- ā¢Simple classes like 1D thresholds and intervals are efficiently and provably PAC learnable; perceptron provides a classic linear separator learner.
Prerequisites
- āProbability and concentration inequalities ā PAC bounds rely on probability over sample draws and tools like Hoeffdingās inequality and union bounds.
- āAsymptotic notation ā Understanding O(Ā·), Ī(Ā·), and sample complexity scaling in ε, Ī“, and VC dimension is essential.
- āBasic linear algebra ā Linear separators and perceptron updates use dot products and norms.
- āBinary classification and loss functions ā PAC learning typically analyzes 0ā1 loss; surrogate losses are used for computational tractability.
- āSorting and prefix sums ā Efficient ERM for thresholds and intervals uses sorting and cumulative counts.
Detailed Explanation
Tap terms for definitions01Overview
Probably Approximately Correct (PAC) learning is a mathematical framework that explains what it means for an algorithm to learn from data. Instead of demanding perfection, PAC allows a learner to be approximately correctāits prediction error need only be below a tolerance εāand probably correctāthis guarantee should hold with probability at least 1āĪ“ over the random draw of training data. Two central questions are: how many samples are needed to achieve these guarantees (sample complexity), and how much computation is needed (computational complexity)? A breakthrough result states that a concept class is PAC learnable if and only if it has finite VapnikāChervonenkis (VC) dimension, a combinatorial measure of expressiveness. The framework also distinguishes the realizable case, where the data are perfectly explainable by the class, from the agnostic case, where noise or misspecification may prevent perfect fit. In both cases, Empirical Risk Minimization (ERM)āchoosing the hypothesis with the smallest training errorāforms the backbone of learning, with uniform convergence results ensuring that training error is a good proxy for true error given enough data. PAC learning is foundational for AI/ML because it provides precise conditions under which learning is possible, clarifies trade-offs among accuracy (ε), confidence (Ī“), and data, and highlights the gap that can exist between being statistically learnable and computationally feasible.
02Intuition & Analogies
Imagine youāre trying to guess the rule for passing a toll booth without getting a ticket. You can drive through many times and observe whether you get a ticket (a labeled example). You donāt need a perfect understanding of the ruleājust one that keeps tickets rare (error ⤠ε). And you want your strategy to work not just on the drives you tried, but in generalāideally with high confidence (at least 1āĪ“) that it will keep your future ticket rate low. Thatās the spirit of PAC learning: approximately right and probably so. The toll rules could come from different types of policies (the concept class): maybe a simple rule like āticket if speed > threshold,ā or a more complex combination of conditions. The more complex the policy family, the more test drives you need to be confident youāve really learned the pattern; this is what VC dimension measures, akin to how many distinct ways the toll can react to different sets of speeds. Empirical Risk Minimization is like picking the rule that caused you the fewest tickets in your practice runs. If you took enough runs and the policy family isnāt too wild, the rule that worked best on your trials will also work well afterward. In the real world, the true rule might not be exactly in your policy family (agnostic case), or there might be randomness (sensor noise, weather). In that case, the best you can hope for is to match, up to ε, the performance of the best rule inside your chosen family. Finally, even if itās statistically possible to learn a rule from the data, finding it might be computationally hardālike having enough clues to solve a puzzle but the search space is still massive. PAC learning separates these two axes: how much data you need and how hard it is to compute a learner.
03Formal Definition
04When to Use
Use PAC learning to reason about whether your chosen hypothesis class has the capacity to learn your task from a feasible amount of data and what error/confidence trade-offs are possible. It is especially useful when selecting model families (e.g., thresholds, intervals, linear separators, decision trees) and needing guarantees about generalization. In the realizable regimeāsuch as synthetic data generated by a known ruleāPAC bounds with ε-linear rates offer tight insights. In realistic, noisy applicationsāsensor noise, label noise, misspecificationāagnostic PAC bounds with ε² dependence capture the data needs more accurately. When your class has finite VC dimension and your algorithm performs ERM or regularized ERM, you can expect uniform convergence, so validation performance approximates test performance as sample size grows. PAC analysis is also valuable when comparing models: a more expressive class might fit better empirically but demand many more samples to generalize. Finally, PAC is a lens for understanding the limits of learning: it highlights when a class is not learnable without assumptions (no-free-lunch) or when computational barriers (e.g., learning DNF under arbitrary distributions) make an otherwise PAC-learnable class intractable without additional structure.
ā ļøCommon Mistakes
⢠Confusing statistical and computational efficiency: A class with finite VC dimension may be PAC learnable, yet finding the ERM can be NP-hard (e.g., general 0ā1 loss linear classification with constraints). Always separate sample complexity from algorithmic tractability. ⢠Misreading ε and Ī“: Smaller ε (higher accuracy) and smaller Ī“ (higher confidence) both demand more samples; reducing one while keeping the other fixed still increases sample complexity. ⢠Overinterpreting asymptotic O(Ā·) bounds: Constants can be large; in practice, use validation to calibrate data needs and consider refined bounds (margins, localized complexities). ⢠Assuming realizability in noisy data: If labels are noisy or the target is outside H, realizable bounds are overly optimistic; switch to agnostic guarantees and algorithms minimizing surrogate losses or performing explicit 0ā1 ERM only where tractable. ⢠Ignoring distributional assumptions: PAC guarantees are distribution-free over X, but algorithmic performance (e.g., perceptron convergence) may rely on margin or separability; be explicit about such assumptions when interpreting results. ⢠Equating VC dimension with parameter count: They are related but not identical; some infinite-parameter models have finite VC dimension and vice versa. ⢠Believing more data always fixes overfitting immediately: If H is too expressive relative to m, training error can be misleading; regularization or restricting H is needed so uniform convergence kicks in at practical sample sizes.
Key Formulas
Risk (Generalization Error)
Explanation: Risk is the probability that hypothesis h misclassifies a fresh example drawn from the data distribution D. It is the true test error we ultimately care about.
Empirical Risk
Explanation: Empirical risk is the average error of h on the training sample S of size m. Uniform convergence connects this to the true risk R(h).
PAC Guarantee (Realizable)
Explanation: In the realizable case, the ERM hypothesis has true error at most ε with probability at least 1ā provided enough samples are used. The probability is over the random draw of the sample.
Finite Class Realizable Sample Complexity
Explanation: For a finite hypothesis class H in the realizable case, this many samples suffice so that ERM achieves error at most ε with probability at least 1ā It follows from a union bound over H.
VC Realizable Sample Complexity
Explanation: If VCdim(H)=d and realizability holds, sample complexity scales linearly with 1/ε (up to a log(1/ and only logarithmically with 1/ This uses Sauer's lemma and uniform convergence.
VC Agnostic Sample Complexity
Explanation: In the agnostic case, to get within ε of the best-in-class error with probability at least 1ā the number of samples grows like 1/ times the VC dimension and confidence term.
Sauer's Lemma (Growth Function)
Explanation: The number of distinct labelings H can realize on m points is bounded by this expression when VCdim(H)=d. It is crucial in converting bounds from finite classes to infinite classes.
Hoeffding's Inequality
Explanation: For bounded i.i.d. variables, the empirical average concentrates around the true mean. It underpins many PAC bounds via uniform convergence.
Uniform Convergence via Growth Function
Explanation: Combining Hoeffding with a union bound over the growth function controls the worst-case deviation between empirical and true errors across all hypotheses.
Generalization Gap (VC Bound)
Explanation: With high probability, the true error is close to the empirical error by an amount that shrinks as sample size grows and complexity (VC dimension) decreases.
Perceptron Mistake Bound
Explanation: If data lie in a ball of radius R and are linearly separable with margin the perceptron makes at most (R/ mistakes. This is a classic realizable-case learning guarantee.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct ThresholdModel { 5 double threshold; // cut point t 6 int polarity; // +1: predict 1 if x >= t, -1: predict 1 if x <= t 7 }; 8 9 // Predict label {0,1} given model and x 10 int predict(const ThresholdModel &m, double x) { 11 if (m.polarity == +1) return (x >= m.threshold) ? 1 : 0; 12 else return (x <= m.threshold) ? 1 : 0; 13 } 14 15 // Train ERM threshold on labeled data (x_i, y_i) with y in {0,1} 16 ThresholdModel train_threshold_erm(vector<pair<double,int>> data) { 17 // Sort by x 18 sort(data.begin(), data.end(), [](auto &a, auto &b){ return a.first < b.first; }); 19 int n = (int)data.size(); 20 vector<double> xs(n); 21 vector<int> ys(n); 22 for (int i = 0; i < n; ++i) { xs[i] = data[i].first; ys[i] = data[i].second; } 23 24 // Candidate thresholds: midpoints between consecutive xs, plus extremes 25 vector<double> cand; 26 cand.reserve(n + 1); 27 if (n == 0) return {0.0, +1}; 28 // Extreme left and right: a bit outside observed range 29 cand.push_back(xs.front() - 1.0); 30 for (int i = 0; i + 1 < n; ++i) { 31 double mid = (xs[i] + xs[i+1]) / 2.0; 32 cand.push_back(mid); 33 } 34 cand.push_back(xs.back() + 1.0); 35 36 ThresholdModel best{cand[0], +1}; 37 int best_err = n + 1; 38 39 // Evaluate each candidate with both polarities 40 for (double t : cand) { 41 for (int pol : {+1, -1}) { 42 ThresholdModel m{t, pol}; 43 int err = 0; 44 for (int i = 0; i < n; ++i) { 45 if (predict(m, xs[i]) != ys[i]) ++err; 46 } 47 if (err < best_err) { 48 best_err = err; 49 best = m; 50 } 51 } 52 } 53 return best; 54 } 55 56 // Compute minimum sample size m for agnostic VC bound: m >= (2/eps^2)*(d*ln(2e/eps) + ln(2/delta)) 57 // This is a practical concrete instantiation (not the only valid one) 58 size_t agnostic_sample_size_vc(int d, double eps, double delta) { 59 if (eps <= 0 || delta <= 0 || delta >= 1) throw invalid_argument("Bad eps/delta"); 60 double term = d * log(2.0 * exp(1.0) / eps) + log(2.0 / delta); 61 double m = (2.0 / (eps * eps)) * term; 62 return (size_t)ceil(m); 63 } 64 65 int main() { 66 // Simple demo dataset (realizable by a threshold at t=3) 67 vector<pair<double,int>> train = { 68 {1.0,0},{1.5,0},{2.0,0},{2.5,0},{3.0,1},{3.5,1},{4.0,1},{5.0,1} 69 }; 70 71 ThresholdModel model = train_threshold_erm(train); 72 cout << fixed << setprecision(3); 73 cout << "Learned threshold t=" << model.threshold << ", polarity=" << model.polarity << "\n"; 74 75 // Evaluate training error 76 int err = 0; for (auto &p : train) err += (predict(model, p.first) != p.second); 77 cout << "Training error = " << err << "/" << train.size() << " (" << (100.0*err/train.size()) << "%)\n"; 78 79 // Show sample complexity suggestion for VC dimension d=1 (thresholds) 80 int d = 1; double eps = 0.1, delta = 0.05; 81 size_t m_needed = agnostic_sample_size_vc(d, eps, delta); 82 cout << "Agnostic VC sample size suggestion for d=1, eps=0.1, delta=0.05: m >= " << m_needed << "\n"; 83 84 // Quick test predictions 85 vector<double> test_x = {2.2, 2.9, 3.1, 4.8}; 86 for (double x : test_x) { 87 cout << "x=" << x << " -> pred=" << predict(model, x) << "\n"; 88 } 89 return 0; 90 } 91
This program implements ERM for 1D threshold classifiers, a classic PAC-learnable class with VC dimension 1. It searches all midpoints between sorted samples and both label polarities, returning the hypothesis with minimal training error. It also includes a helper that estimates the agnostic VC sample complexity for given ε, Γ, and VC dimension d, illustrating how theoretical bounds guide required data size.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Perceptron { 5 vector<double> w; // weights 6 double b; // bias 7 double lr; // learning rate 8 9 Perceptron(int d, double lr_=1.0) : w(d, 0.0), b(0.0), lr(lr_) {} 10 11 int predict(const vector<double>& x) const { 12 double s = inner_product(w.begin(), w.end(), x.begin(), 0.0) + b; 13 return s >= 0 ? +1 : -1; 14 } 15 16 int fit(const vector<vector<double>>& X, const vector<int>& y, int max_epochs=1000) { 17 int n = (int)X.size(); 18 int d = (int)w.size(); 19 int updates = 0; 20 for (int epoch = 0; epoch < max_epochs; ++epoch) { 21 int mistakes = 0; 22 for (int i = 0; i < n; ++i) { 23 int pred = predict(X[i]); 24 if (pred != y[i]) { 25 // Update: w <- w + lr * y_i * x_i, b <- b + lr * y_i 26 for (int j = 0; j < d; ++j) w[j] += lr * y[i] * X[i][j]; 27 b += lr * y[i]; 28 ++updates; ++mistakes; 29 } 30 } 31 if (mistakes == 0) break; // converged on separable data 32 } 33 return updates; 34 } 35 }; 36 37 int main(){ 38 // Simple linearly separable data in R^2 39 vector<vector<double>> X = { 40 {2.0, 1.0}, {1.0, 1.5}, {2.5, 2.0}, {3.0, 1.0}, // class +1 41 {-1.0, -1.0}, {-2.0, -0.5}, {-1.5, -2.0}, {-2.5, -1.5} // class -1 42 }; 43 vector<int> y = {+1,+1,+1,+1, -1,-1,-1,-1}; 44 45 Perceptron clf(2, 1.0); 46 int updates = clf.fit(X, y, 100); 47 cout << "Updates: " << updates << "\n"; 48 cout << fixed << setprecision(3); 49 cout << "w = [" << clf.w[0] << ", " << clf.w[1] << "], b = " << clf.b << "\n"; 50 51 // Evaluate training accuracy 52 int correct = 0; 53 for (size_t i = 0; i < X.size(); ++i) correct += (clf.predict(X[i]) == y[i]); 54 cout << "Training accuracy: " << (100.0*correct/X.size()) << "%\n"; 55 return 0; 56 } 57
This is the classic perceptron algorithm for linear separators. In the realizable (separable) setting, perceptron finds a consistent classifier and enjoys a mistake bound of O((R/γ)²), aligning with PAC-style guarantees. It illustrates computational efficiency (simple updates) alongside statistical conditions (separability and margin).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct IntervalModel { 5 double left, right; // predict 1 if x in [left, right], else 0 6 }; 7 8 int predict(const IntervalModel &m, double x){ 9 return (x >= m.left && x <= m.right) ? 1 : 0; 10 } 11 12 // Train agnostic ERM over all intervals [x_i, x_j] 13 IntervalModel train_interval_erm(vector<pair<double,int>> data) { 14 int n = (int)data.size(); 15 sort(data.begin(), data.end(), [](auto &a, auto &b){ return a.first < b.first; }); 16 vector<double> xs(n); 17 vector<int> ys(n); 18 for (int i = 0; i < n; ++i) { xs[i] = data[i].first; ys[i] = data[i].second; } 19 20 // Prefix sums for counts of positives and negatives 21 vector<int> pref_pos(n+1,0), pref_neg(n+1,0); 22 for (int i = 0; i < n; ++i) { 23 pref_pos[i+1] = pref_pos[i] + (ys[i]==1); 24 pref_neg[i+1] = pref_neg[i] + (ys[i]==0); 25 } 26 auto cnt_pos = [&](int l, int r){ return pref_pos[r+1] - pref_pos[l]; }; 27 auto cnt_neg = [&](int l, int r){ return pref_neg[r+1] - pref_neg[l]; }; 28 29 // Candidate interval endpoints chosen among data points; use midpoints for prediction boundaries 30 int best_err = n + 1; IntervalModel best{0,0}; 31 // Also consider empty interval (always predict 0) and full interval (always predict 1) 32 int err_all0 = pref_pos[n]; 33 if (err_all0 < best_err) { best_err = err_all0; best = {+numeric_limits<double>::infinity(), -numeric_limits<double>::infinity()}; } 34 int err_all1 = pref_neg[n]; 35 if (err_all1 < best_err) { best_err = err_all1; best = {-numeric_limits<double>::infinity(), +numeric_limits<double>::infinity()}; } 36 37 for (int i = 0; i < n; ++i) { 38 for (int j = i; j < n; ++j) { 39 // Errors: negatives inside [i..j] + positives outside (total_pos - pos_in) 40 int neg_in = cnt_neg(i, j); 41 int pos_in = cnt_pos(i, j); 42 int pos_out = pref_pos[n] - pos_in; 43 int err = neg_in + pos_out; 44 if (err < best_err) { 45 best_err = err; 46 // Set boundaries between points for deterministic predictions 47 double left = (i==0) ? xs[i] - 1.0 : (xs[i-1] + xs[i]) / 2.0; 48 double right = (j==n-1) ? xs[j] + 1.0 : (xs[j] + xs[j+1]) / 2.0; 49 best = {left, right}; 50 } 51 } 52 } 53 return best; 54 } 55 56 int main(){ 57 // Mixed/noisy labels: not perfectly representable by a single interval 58 vector<pair<double,int>> train = { 59 {0.0,0},{0.5,0},{1.0,1},{1.5,1},{2.0,1},{2.5,0},{3.0,0},{3.5,1},{4.0,0} 60 }; 61 62 IntervalModel m = train_interval_erm(train); 63 cout << fixed << setprecision(3); 64 cout << "Learned interval: [" << m.left << ", " << m.right << "]\n"; 65 66 int err = 0; for (auto &p : train) err += (predict(m, p.first) != p.second); 67 cout << "Training error = " << err << "/" << train.size() << " (" << (100.0*err/train.size()) << "%)\n"; 68 69 vector<double> test_x = {0.25, 1.25, 2.25, 3.25}; 70 for (double x : test_x) cout << "x=" << x << " -> pred=" << predict(m, x) << "\n"; 71 return 0; 72 } 73
This program performs agnostic ERM over the class of 1D intervals (predict 1 inside, 0 outside). It evaluates all O(m²) intervals efficiently using prefix sums to compute the number of misclassifications for each candidate. Intervals have VC dimension 2, so this class is PAC learnable; the algorithm is polynomial-time and demonstrates agnostic learning because it minimizes empirical 0ā1 loss without assuming realizability.