PAC-Bayes Theory
Key Points
- •PAC-Bayes provides high-probability generalization bounds for randomized predictors by comparing a data-dependent posterior Q to a fixed, data-independent prior P through KL(Q||P).
- •The bound says the expected true risk under Q is close to the expected empirical risk under Q, plus a complexity penalty that depends on KL(Q||P), sample size n, and confidence
- •A tight form uses the binary relative entropy kl(·||P) + (2/) ) / n.
- •When Q concentrates on good hypotheses (e.g., flat minima in neural networks), KL(Q||P) can be small, yielding bounds much tighter than uniform convergence.
- •You can choose P to reflect prior beliefs before seeing data, and design Q (e.g., Gaussian over weights) during training, then evaluate the bound post hoc.
- •PAC-Bayes naturally connects to compression and MDL: a compressed model of L bits gives KL ≈ L 2, turning code length into a generalization guarantee.
- •It applies directly to neural networks by taking Q as a distribution over weights and estimating [\hat R] via sampling (reparameterization trick).
- •Modern theory uses PAC-Bayes to explain why flat minima generalize: wide regions of low loss allow low-KL posteriors that keep empirical risk small.
Prerequisites
- →Probability basics — Expectations, independence, and distributions are needed to define risks and randomized predictors.
- →Kullback–Leibler divergence — KL(Q||P) is the core complexity term in PAC-Bayes; understanding its meaning and units is essential.
- →Empirical risk minimization (ERM) — PAC-Bayes relates empirical (training) risk to true (test) risk; ERM provides the context.
- →Concentration inequalities — PAC-Bayes proofs use concentration and change-of-measure arguments akin to Chernoff/Hoeffding bounds.
- →Bayesian inference / variational methods — Choosing and working with a posterior Q (e.g., Gaussian over weights) mirrors Bayesian and variational practice.
- →Optimization basics — Training models and potentially tuning Q requires gradient-based optimization knowledge.
- →Linear classifiers and logistic regression — Serve as simple hypothesis classes to demonstrate PAC-Bayes bounds concretely.
- →C++ programming with random number generation — Implementing Monte Carlo estimates and sampling from Gaussian posteriors uses RNGs and numerical routines.
Detailed Explanation
Tap terms for definitions01Overview
PAC-Bayes is a framework for understanding why machine learning models trained on finite data can still perform well on unseen data. Instead of analyzing a single, fixed model, PAC-Bayes studies randomized predictors: we first pick a hypothesis h at random from a posterior distribution Q (which depends on the data), and then use h to predict. The key insight is that generalization is controlled by how much Q departs from a data-independent prior P, measured by Kullback–Leibler divergence KL(Q||P). If Q is close to P (small divergence), then with high probability the expected true risk under Q is close to the expected training risk under Q, up to a term that shrinks with sample size n and increases with confidence 1−δ. This yields data-dependent bounds that can be much sharper than classical uniform convergence bounds, especially when Q places mass on a small, good region of hypothesis space. In deep learning, we commonly choose Q to be a Gaussian over weights (with mean at a trained solution and certain variance), and evaluate the bound by sampling models from Q to estimate the empirical risk. PAC-Bayes also unifies with compression and Minimum Description Length (MDL): a short code for a model (few bits) corresponds to a small KL penalty, giving a principled route from compressibility to generalization.
02Intuition & Analogies
Imagine you are buying an insurance policy for your model’s performance. The insurer asks two things: (1) how risky is your plan to pick a particular model? and (2) how much evidence do you have that your plan works on the data you saw? The plan, in PAC-Bayes, is a randomized strategy: instead of committing to one model, you draw one from a distribution Q. Your pre-data beliefs are encoded in a prior P; after seeing data, you update to Q. The insurer charges you a premium proportional to how far Q moved from P—this is KL(Q||P). If you barely changed your mind (small KL), you pay a small premium. Your proof that the plan works is your training performance E_Q[\hat R], averaged over draws from Q. The final guarantee says: your test performance E_Q[R] will be close to your training performance E_Q[\hat R], plus a surcharge that grows with the premium (KL) and shrinks with more data (n). Why does this capture flat minima? Picture balancing a marble in a valley. A flat valley lets you wiggle the marble (sample different models around the solution) without increasing loss much. That means you can use a relatively broad Q (spread out posterior) and still have low empirical risk. However, a broader Q also increases KL unless P is broad too; in a flat valley, you can still keep KL modest because the posterior doesn’t need to concentrate razor-thinly to stay accurate. In contrast, a sharp minimum forces Q to be very narrow to keep training error low; narrow Q often means larger KL relative to a simple prior, worsening the bound.
03Formal Definition
04When to Use
Use PAC-Bayes when you have (or can posit) a distribution over models and want a data-dependent generalization guarantee. Typical situations include: (1) Bayesian or variational training of neural networks, where Q is a Gaussian over weights and P is a simple zero-mean Gaussian prior; (2) ensemble or stochastic predictors (e.g., dropout), where prediction itself is randomized; (3) analyzing flat minima by measuring loss under random perturbations around a solution—if perturbed models remain good, E_Q[\hat R] is low for a reasonably spread Q; (4) compression-based analysis: if your trained model compresses to L bits, treat P(h)=2^{-L} (or use a coding scheme) to turn code length into a KL term and obtain a guarantee; (5) hyperparameter or architecture selection with a hyper-prior: you can pick a prior from a family using a held-out split or pay an extra KL to formalize the choice. Prefer PAC-Bayes over uniform VC/Rademacher bounds when the posterior Q is highly concentrated on a small, good region of H, making KL(Q||P) small even if H is huge (as in deep nets). Avoid it if you cannot sensibly specify a data-independent prior (or hyper-prior), if your losses are unbounded without using suitable variants (e.g., Catoni), or if you only care about a single deterministic predictor without relating it to a Gibbs/averaged predictor (though majority-vote and deterministic conversions exist).
⚠️Common Mistakes
• Data-dependent prior without correction: Choosing P after seeing S invalidates the standard bound. Fix P before data, or use a hyper-prior π and pay an extra term KL(P|\pi), or split data to select P. • Mixing units: KL must be in nats for the natural logarithm. If you measure compression in bits, multiply by (\ln 2). • Estimating E_Q[\hat R] poorly: Using too few posterior samples can make the Monte Carlo estimate high-variance, loosening bounds. Increase samples, use common random numbers, and ensure reproducibility. • Misapplying the square-root bound: The simple (\sqrt{\cdot}) form assumes bounded (often 0–1) losses; for real-valued losses, use Catoni/Bernstein PAC-Bayes variants. • Ignoring confidence δ: Picking δ after looking at results is multiple testing. Decide δ a priori, or correct with union bounds across many evaluations. • Forgetting the predictor is Gibbs: The bound is on the randomized (draw-then-predict) strategy. To bound a deterministic predictor, use majority-vote/C-bounds or relate the deterministic model to a nearby randomized one and argue closeness of risks. • Overfitting to the bound: Tuning Q to minimize the numerical bound on the same S without care can overfit the certificate. Use validation, report δ, and avoid repeated peeking without correction. • Inconsistent priors: Using a very narrow prior P mismatched to plausible solutions inflates KL(Q||P). Choose P to reflect symmetries (e.g., zero-mean isotropic Gaussians for weights) to keep KL manageable.
Key Formulas
Gibbs (posterior-averaged) risk
Explanation: This is the expected true risk when you first draw a model from the posterior Q and then evaluate its loss on the data distribution. It is the main target of PAC-Bayes bounds.
Empirical Gibbs risk
Explanation: This is the expected training loss averaged over models sampled from Q. It is observable from data and approximated by Monte Carlo.
PAC-Bayes-kl (McAllester)
Explanation: With probability at least 1−δ over the sample, this inequality holds simultaneously for all posteriors Q. It tightly links training error, test error, and the complexity term KL(Q||P).
Square-root PAC-Bayes bound
Explanation: A convenient, explicit upper bound on the true risk obtained by inverting the kl-inequality via Pinsker-type arguments. It is slightly looser but easy to compute.
KL divergence (definition)
Explanation: Measures how far the posterior Q moves from the prior P, in nats when using natural logs. It acts as the complexity penalty in PAC-Bayes bounds.
KL between diagonal Gaussians
Explanation: Closed-form KL used when Q and P are factorized Gaussians over parameters. It scales linearly with dimension and is cheap to compute.
Binary relative entropy
Explanation: This is the KL divergence between Bernoulli(a) and Bernoulli(b). It appears in the tight PAC-Bayes-kl bound for 0–1 losses.
Reparameterization
Explanation: Sampling from a diagonal Gaussian posterior is implemented by perturbing the mean with scaled standard normal noise. This enables Monte Carlo estimates of [\hat R].
Deterministic posterior KL
Explanation: If Q is a point mass at h, the KL reduces to the negative log-prior probability of h. Under a coding prior P(h)=2^{-L}, this equals L ln 2, linking compression to generalization.
Catoni-style PAC-Bayes
Explanation: A sharper bound for bounded real-valued losses using an exponential transform. The parameter λ balances fit and complexity.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <random> 4 #include <cmath> 5 #include <algorithm> 6 #include <numeric> 7 #include <limits> 8 9 // Utility: dot product 10 double dot(const std::vector<double>& a, const std::vector<double>& b) { 11 double s = 0.0; 12 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; 13 return s; 14 } 15 16 // Sigmoid for logistic prediction (not strictly needed for 0-1 with sign, but illustrative) 17 double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); } 18 19 // Generate a synthetic binary classification dataset 20 struct Dataset { 21 std::vector<std::vector<double>> X; // n x d 22 std::vector<int> y; // n labels in {0,1} 23 }; 24 25 Dataset make_synthetic(size_t n, size_t d, unsigned seed = 42) { 26 std::mt19937 rng(seed); 27 std::normal_distribution<double> N01(0.0, 1.0); 28 // True separating hyperplane 29 std::vector<double> w_true(d, 0.0); 30 for (size_t j = 0; j < d; ++j) w_true[j] = N01(rng); 31 Dataset data; 32 data.X.resize(n, std::vector<double>(d, 0.0)); 33 data.y.resize(n, 0); 34 for (size_t i = 0; i < n; ++i) { 35 for (size_t j = 0; j < d; ++j) data.X[i][j] = N01(rng); 36 double z = dot(data.X[i], w_true) + 0.3 * N01(rng); // add noise 37 data.y[i] = (z > 0.0) ? 1 : 0; 38 } 39 return data; 40 } 41 42 // Diagonal Gaussian distribution over parameters 43 struct DiagGaussian { 44 std::vector<double> mu; // mean (size d) 45 std::vector<double> log_std; // log standard deviation (size d) 46 }; 47 48 // Sample from a diagonal Gaussian: w = mu + exp(log_std) * eps 49 std::vector<double> sample(const DiagGaussian& Q, std::mt19937& rng) { 50 std::normal_distribution<double> N01(0.0, 1.0); 51 size_t d = Q.mu.size(); 52 std::vector<double> w(d); 53 for (size_t j = 0; j < d; ++j) { 54 double eps = N01(rng); 55 double sigma = std::exp(Q.log_std[j]); 56 w[j] = Q.mu[j] + sigma * eps; 57 } 58 return w; 59 } 60 61 // KL(Q || P) for diagonal Gaussians in nats 62 // Q = N(mu_q, diag(sigma_q^2)), P = N(mu_p, diag(sigma_p^2)) 63 double kl_diag_gaussians(const DiagGaussian& Q, const DiagGaussian& P) { 64 size_t d = Q.mu.size(); 65 double kl = 0.0; 66 for (size_t j = 0; j < d; ++j) { 67 double muq = Q.mu[j], mup = P.mu[j]; 68 double sq = std::exp(2.0 * Q.log_std[j]); 69 double sp = std::exp(2.0 * P.log_std[j]); 70 // 0.5 * [ (sq + (muq-mup)^2)/sp - 1 + 2*log(sp^{1/2}/sq^{1/2}) ] 71 double term = (sq + (muq - mup) * (muq - mup)) / sp - 1.0 + 2.0 * (P.log_std[j] - Q.log_std[j]); 72 kl += 0.5 * term; 73 } 74 return kl; 75 } 76 77 // 0-1 empirical risk for a fixed weight vector on dataset 78 // Predict 1 if dot(w,x) >= 0, else 0 79 double empirical_01_risk(const std::vector<double>& w, const Dataset& data) { 80 size_t n = data.X.size(); 81 size_t mistakes = 0; 82 for (size_t i = 0; i < n; ++i) { 83 double z = dot(w, data.X[i]); 84 int pred = (z >= 0.0) ? 1 : 0; 85 mistakes += (pred != data.y[i]); 86 } 87 return static_cast<double>(mistakes) / static_cast<double>(n); 88 } 89 90 // Monte Carlo estimate of E_Q[\hat R] by sampling M weight vectors from Q 91 double monte_carlo_empirical_risk(const DiagGaussian& Q, const Dataset& data, size_t M, unsigned seed = 123) { 92 std::mt19937 rng(seed); 93 double sum = 0.0; 94 for (size_t m = 0; m < M; ++m) { 95 auto w = sample(Q, rng); 96 sum += empirical_01_risk(w, data); 97 } 98 return sum / static_cast<double>(M); 99 } 100 101 // Binary relative entropy kl(a || b) with numerical safeguards 102 double kl_bernoulli(double a, double b) { 103 const double eps = 1e-12; 104 a = std::clamp(a, eps, 1.0 - eps); 105 b = std::clamp(b, eps, 1.0 - eps); 106 return a * std::log(a / b) + (1.0 - a) * std::log((1.0 - a) / (1.0 - b)); 107 } 108 109 // Invert kl(a || b) <= c to find the smallest b in [a,1] satisfying the inequality (upper bound on true risk) 110 // Uses bisection. 111 double inverse_kl_upper(double a, double c) { 112 double lo = a; 113 double hi = 1.0 - 1e-12; 114 for (int it = 0; it < 100; ++it) { 115 double mid = 0.5 * (lo + hi); 116 double val = kl_bernoulli(a, mid); 117 if (val > c) { 118 // need larger b to reduce kl(a||b) 119 lo = mid; 120 } else { 121 hi = mid; 122 } 123 } 124 return hi; // minimal b satisfying kl(a||b) <= c 125 } 126 127 int main() { 128 // 1) Build data 129 size_t n = 1000; // number of examples 130 size_t d = 20; // dimension 131 Dataset data = make_synthetic(n, d, 7); 132 133 // 2) Define prior P and posterior Q (diagonal Gaussians) 134 DiagGaussian P; P.mu.assign(d, 0.0); P.log_std.assign(d, std::log(1.0)); // N(0, I) 135 136 DiagGaussian Q; Q.mu.assign(d, 0.0); Q.log_std.assign(d, std::log(0.2)); 137 // Set Q's mean to a rough linear classifier by least squares sign proxy (simple heuristic) 138 // Here we compute a quick pseudo-solution: w = sum_i (2*y_i-1) * x_i 139 for (size_t i = 0; i < n; ++i) { 140 double t = (data.y[i] ? 1.0 : -1.0); 141 for (size_t j = 0; j < d; ++j) Q.mu[j] += t * data.X[i][j]; 142 } 143 // Normalize 144 double norm = std::sqrt(dot(Q.mu, Q.mu) + 1e-12); 145 for (size_t j = 0; j < d; ++j) Q.mu[j] /= norm; 146 147 // 3) Estimate E_Q[\hat R] by Monte Carlo 148 size_t M = 200; // number of posterior samples 149 double hat_R_Q = monte_carlo_empirical_risk(Q, data, M); 150 151 // 4) Compute KL(Q||P) 152 double KL_QP = kl_diag_gaussians(Q, P); 153 154 // 5) Compute PAC-Bayes bounds 155 double delta = 0.05; // 95% confidence 156 double penalty = (KL_QP + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n); 157 158 // Square-root explicit bound (slightly looser) 159 double bound_sqrt = hat_R_Q + std::sqrt(0.5 * penalty); 160 161 // Tight kl-inverse bound: find the smallest R s.t. kl(hat_R_Q || R) <= penalty 162 double bound_klinverse = inverse_kl_upper(hat_R_Q, penalty); 163 164 // 6) Report 165 std::cout << "n = " << n << ", d = " << d << ", M = " << M << "\n"; 166 std::cout << "Empirical Gibbs risk E_Q[hat R] = " << hat_R_Q << "\n"; 167 std::cout << "KL(Q||P) [nats] = " << KL_QP << "\n"; 168 std::cout << "delta = " << delta << ", penalty = " << penalty << "\n"; 169 std::cout << "PAC-Bayes sqrt bound: R_Q <= " << bound_sqrt << "\n"; 170 std::cout << "PAC-Bayes kl-inverse bound: R_Q <= " << bound_klinverse << "\n"; 171 172 return 0; 173 } 174
This program builds a synthetic binary dataset, defines a prior P and posterior Q as diagonal Gaussians over a linear classifier’s weights, and estimates the empirical Gibbs risk E_Q[\hat R] by sampling M weight vectors from Q. It computes the closed-form KL(Q||P) between diagonal Gaussians, then evaluates two PAC-Bayes bounds: a convenient square-root form and a tighter kl-inverse bound obtained by solving kl(\hat R_Q || R_Q) ≤ (KL+log(2√n/δ))/n via bisection. This directly demonstrates how small empirical error and small KL lead to small generalization error under the randomized (Gibbs) classifier.
1 #include <iostream> 2 #include <cmath> 3 #include <algorithm> 4 5 // Given: sample size n, empirical risk hat_R in [0,1], total code length L_bits, and confidence delta. 6 // Returns: square-root PAC-Bayes bound using KL = L_bits * ln 2 (nats) for a point-mass posterior. 7 double pac_bayes_compression_bound(int n, double hat_R, long long L_bits, double delta) { 8 const double LN2 = std::log(2.0); 9 double KL = static_cast<double>(L_bits) * LN2; // convert bits to nats 10 double penalty = (KL + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n); 11 double bound = hat_R + std::sqrt(0.5 * penalty); 12 return std::min(1.0, std::max(0.0, bound)); 13 } 14 15 int main() { 16 int n = 50000; // number of training examples 17 double hat_R = 0.02; // 2% training error of the compressed model 18 int d = 1'000'000; // number of weights 19 int bits_per_weight = 4; // quantization to 16 levels 20 long long header_bits = 1024; // overhead for codebook, shapes, etc. 21 22 long long L_bits = static_cast<long long>(d) * bits_per_weight + header_bits; 23 double delta = 0.05; // 95% confidence 24 25 double bound = pac_bayes_compression_bound(n, hat_R, L_bits, delta); 26 27 std::cout << "Compression-based PAC-Bayes bound (point posterior):\n"; 28 std::cout << "n = " << n << ", hat_R = " << hat_R << ", L_bits = " << L_bits << ", delta = " << delta << "\n"; 29 std::cout << "Generalization certificate: R <= " << bound << "\n"; 30 return 0; 31 } 32
This example shows the MDL connection: a deterministic (compressed) model corresponds to a point-mass posterior Q=δ_h. If the coding scheme assigns probability P(h)=2^{-L} to code length L bits, then KL(Q||P)=ln(1/P(h))=L ln 2 nats. Plugging this into the PAC-Bayes square-root bound yields a generalization guarantee purely from training error, sample size, confidence, and code length. It is a practical way to certify large networks after aggressive quantization/pruning.