📚TheoryAdvanced

PAC-Bayes Theory

Key Points

  • PAC-Bayes provides high-probability generalization bounds for randomized predictors by comparing a data-dependent posterior Q to a fixed, data-independent prior P through KL(Q||P).
  • The bound says the expected true risk under Q is close to the expected empirical risk under Q, plus a complexity penalty that depends on KL(Q||P), sample size n, and confidence
  • A tight form uses the binary relative entropy kl(·||P) + (2/) ) / n.
  • When Q concentrates on good hypotheses (e.g., flat minima in neural networks), KL(Q||P) can be small, yielding bounds much tighter than uniform convergence.
  • You can choose P to reflect prior beliefs before seeing data, and design Q (e.g., Gaussian over weights) during training, then evaluate the bound post hoc.
  • PAC-Bayes naturally connects to compression and MDL: a compressed model of L bits gives KL ≈ L 2, turning code length into a generalization guarantee.
  • It applies directly to neural networks by taking Q as a distribution over weights and estimating [\hat R] via sampling (reparameterization trick).
  • Modern theory uses PAC-Bayes to explain why flat minima generalize: wide regions of low loss allow low-KL posteriors that keep empirical risk small.

Prerequisites

  • Probability basicsExpectations, independence, and distributions are needed to define risks and randomized predictors.
  • Kullback–Leibler divergenceKL(Q||P) is the core complexity term in PAC-Bayes; understanding its meaning and units is essential.
  • Empirical risk minimization (ERM)PAC-Bayes relates empirical (training) risk to true (test) risk; ERM provides the context.
  • Concentration inequalitiesPAC-Bayes proofs use concentration and change-of-measure arguments akin to Chernoff/Hoeffding bounds.
  • Bayesian inference / variational methodsChoosing and working with a posterior Q (e.g., Gaussian over weights) mirrors Bayesian and variational practice.
  • Optimization basicsTraining models and potentially tuning Q requires gradient-based optimization knowledge.
  • Linear classifiers and logistic regressionServe as simple hypothesis classes to demonstrate PAC-Bayes bounds concretely.
  • C++ programming with random number generationImplementing Monte Carlo estimates and sampling from Gaussian posteriors uses RNGs and numerical routines.

Detailed Explanation

Tap terms for definitions

01Overview

PAC-Bayes is a framework for understanding why machine learning models trained on finite data can still perform well on unseen data. Instead of analyzing a single, fixed model, PAC-Bayes studies randomized predictors: we first pick a hypothesis h at random from a posterior distribution Q (which depends on the data), and then use h to predict. The key insight is that generalization is controlled by how much Q departs from a data-independent prior P, measured by Kullback–Leibler divergence KL(Q||P). If Q is close to P (small divergence), then with high probability the expected true risk under Q is close to the expected training risk under Q, up to a term that shrinks with sample size n and increases with confidence 1−δ. This yields data-dependent bounds that can be much sharper than classical uniform convergence bounds, especially when Q places mass on a small, good region of hypothesis space. In deep learning, we commonly choose Q to be a Gaussian over weights (with mean at a trained solution and certain variance), and evaluate the bound by sampling models from Q to estimate the empirical risk. PAC-Bayes also unifies with compression and Minimum Description Length (MDL): a short code for a model (few bits) corresponds to a small KL penalty, giving a principled route from compressibility to generalization.

02Intuition & Analogies

Imagine you are buying an insurance policy for your model’s performance. The insurer asks two things: (1) how risky is your plan to pick a particular model? and (2) how much evidence do you have that your plan works on the data you saw? The plan, in PAC-Bayes, is a randomized strategy: instead of committing to one model, you draw one from a distribution Q. Your pre-data beliefs are encoded in a prior P; after seeing data, you update to Q. The insurer charges you a premium proportional to how far Q moved from P—this is KL(Q||P). If you barely changed your mind (small KL), you pay a small premium. Your proof that the plan works is your training performance E_Q[\hat R], averaged over draws from Q. The final guarantee says: your test performance E_Q[R] will be close to your training performance E_Q[\hat R], plus a surcharge that grows with the premium (KL) and shrinks with more data (n). Why does this capture flat minima? Picture balancing a marble in a valley. A flat valley lets you wiggle the marble (sample different models around the solution) without increasing loss much. That means you can use a relatively broad Q (spread out posterior) and still have low empirical risk. However, a broader Q also increases KL unless P is broad too; in a flat valley, you can still keep KL modest because the posterior doesn’t need to concentrate razor-thinly to stay accurate. In contrast, a sharp minimum forces Q to be very narrow to keep training error low; narrow Q often means larger KL relative to a simple prior, worsening the bound.

03Formal Definition

Let H be a hypothesis class and \(: [0,1]\) a bounded loss (e.g., 0–1). A randomized (Gibbs) predictor draws \(h Q\) and predicts using h. For data distribution \(\) over \(\) and sample \(S=\{(,)\}_{i=1}^{n}\) i.i.d., define the true risk of h: \(R(h) = [(h(x),y)]\) and empirical risk \(\hat R(h) = (h(),)\). The Gibbs (posterior-averaged) risks are \( = [R(h)]\) and \(\hat = [\hat R(h)]\). Given a data-independent prior distribution P over H and any posterior Q (possibly chosen after seeing S), the classical PAC-Bayes-kl bound (McAllester) states that with probability at least \(1-\) over the draw of S, for all Q: \[ (\hat \,\|\,R_Q) \le \frac{\mathrm{KL}(Q\|P) + \!\left(\tfrac{2\sqrt{n}}{\delta}\right)}{n}, \] where \((a\|b) = a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b}\) is the binary relative entropy and \(\mathrm{KL}(Q\|P) = [ ]\) is in nats. Inverted to bound \(\), a convenient (though slightly looser) form is \[ \hat + \sqrt{\frac{(Q\|P) + \!\left(\tfrac{2\sqrt{n}}{\delta}\right)}{2n}}. \] Numerous variants exist (Catoni/Bernstein forms for real-valued losses, hierarchical or data-dependent priors via extra KL terms, and bounds on majority votes), but all share the same skeleton: data fit under Q plus a KL complexity penalty.

04When to Use

Use PAC-Bayes when you have (or can posit) a distribution over models and want a data-dependent generalization guarantee. Typical situations include: (1) Bayesian or variational training of neural networks, where Q is a Gaussian over weights and P is a simple zero-mean Gaussian prior; (2) ensemble or stochastic predictors (e.g., dropout), where prediction itself is randomized; (3) analyzing flat minima by measuring loss under random perturbations around a solution—if perturbed models remain good, E_Q[\hat R] is low for a reasonably spread Q; (4) compression-based analysis: if your trained model compresses to L bits, treat P(h)=2^{-L} (or use a coding scheme) to turn code length into a KL term and obtain a guarantee; (5) hyperparameter or architecture selection with a hyper-prior: you can pick a prior from a family using a held-out split or pay an extra KL to formalize the choice. Prefer PAC-Bayes over uniform VC/Rademacher bounds when the posterior Q is highly concentrated on a small, good region of H, making KL(Q||P) small even if H is huge (as in deep nets). Avoid it if you cannot sensibly specify a data-independent prior (or hyper-prior), if your losses are unbounded without using suitable variants (e.g., Catoni), or if you only care about a single deterministic predictor without relating it to a Gibbs/averaged predictor (though majority-vote and deterministic conversions exist).

⚠️Common Mistakes

• Data-dependent prior without correction: Choosing P after seeing S invalidates the standard bound. Fix P before data, or use a hyper-prior π and pay an extra term KL(P|\pi), or split data to select P. • Mixing units: KL must be in nats for the natural logarithm. If you measure compression in bits, multiply by (\ln 2). • Estimating E_Q[\hat R] poorly: Using too few posterior samples can make the Monte Carlo estimate high-variance, loosening bounds. Increase samples, use common random numbers, and ensure reproducibility. • Misapplying the square-root bound: The simple (\sqrt{\cdot}) form assumes bounded (often 0–1) losses; for real-valued losses, use Catoni/Bernstein PAC-Bayes variants. • Ignoring confidence δ: Picking δ after looking at results is multiple testing. Decide δ a priori, or correct with union bounds across many evaluations. • Forgetting the predictor is Gibbs: The bound is on the randomized (draw-then-predict) strategy. To bound a deterministic predictor, use majority-vote/C-bounds or relate the deterministic model to a nearby randomized one and argue closeness of risks. • Overfitting to the bound: Tuning Q to minimize the numerical bound on the same S without care can overfit the certificate. Use validation, report δ, and avoid repeated peeking without correction. • Inconsistent priors: Using a very narrow prior P mismatched to plausible solutions inflates KL(Q||P). Choose P to reflect symmetries (e.g., zero-mean isotropic Gaussians for weights) to keep KL manageable.

Key Formulas

Gibbs (posterior-averaged) risk

Explanation: This is the expected true risk when you first draw a model from the posterior Q and then evaluate its loss on the data distribution. It is the main target of PAC-Bayes bounds.

Empirical Gibbs risk

Explanation: This is the expected training loss averaged over models sampled from Q. It is observable from data and approximated by Monte Carlo.

PAC-Bayes-kl (McAllester)

Explanation: With probability at least 1−δ over the sample, this inequality holds simultaneously for all posteriors Q. It tightly links training error, test error, and the complexity term KL(Q||P).

Square-root PAC-Bayes bound

Explanation: A convenient, explicit upper bound on the true risk obtained by inverting the kl-inequality via Pinsker-type arguments. It is slightly looser but easy to compute.

KL divergence (definition)

Explanation: Measures how far the posterior Q moves from the prior P, in nats when using natural logs. It acts as the complexity penalty in PAC-Bayes bounds.

KL between diagonal Gaussians

Explanation: Closed-form KL used when Q and P are factorized Gaussians over parameters. It scales linearly with dimension and is cheap to compute.

Binary relative entropy

Explanation: This is the KL divergence between Bernoulli(a) and Bernoulli(b). It appears in the tight PAC-Bayes-kl bound for 0–1 losses.

Reparameterization

Explanation: Sampling from a diagonal Gaussian posterior is implemented by perturbing the mean with scaled standard normal noise. This enables Monte Carlo estimates of [\hat R].

Deterministic posterior KL

Explanation: If Q is a point mass at h, the KL reduces to the negative log-prior probability of h. Under a coding prior P(h)=2^{-L}, this equals L ln 2, linking compression to generalization.

Catoni-style PAC-Bayes

Explanation: A sharper bound for bounded real-valued losses using an exponential transform. The parameter λ balances fit and complexity.

Complexity Analysis

For practical evaluation, the main costs are (i) estimating the empirical Gibbs risk [\hat R] via Monte Carlo and (ii) computing KL(Q||b) and performing a bisection search over b ∈ [a,1]; each iteration is O(1), and a small fixed number (≈50–100) of iterations suffices for high precision, yielding O(1) extra time. Space usage is O(n d) if you materialize the data matrix, or O(n + d) if you stream examples and store parameters and statistics only. If the hypothesis evaluation is a neural network forward pass, replace O(d) per example by the cost of one forward pass; the overall scaling remains linear in M and n. Increasing M reduces Monte Carlo variance but linearly increases time; using common random numbers and low-variance estimators can improve accuracy for a fixed budget. Compression-based bounds avoid Monte Carlo for [\hat R] if you use a deterministic model (Q is a point mass), in which case you only pay for a single dataset evaluation O(n · cost(model)) and compute KL from the code length in O(1) or O(d), depending on the coding scheme. In summary, PAC-Bayes evaluation is typically linear in dataset size and in the number of posterior samples, with modest memory overhead.

Code Examples

PAC-Bayes bound for a Gaussian posterior over a linear classifier (Monte Carlo + kl-inverse)
1#include <iostream>
2#include <vector>
3#include <random>
4#include <cmath>
5#include <algorithm>
6#include <numeric>
7#include <limits>
8
9// Utility: dot product
10double dot(const std::vector<double>& a, const std::vector<double>& b) {
11 double s = 0.0;
12 for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
13 return s;
14}
15
16// Sigmoid for logistic prediction (not strictly needed for 0-1 with sign, but illustrative)
17double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); }
18
19// Generate a synthetic binary classification dataset
20struct Dataset {
21 std::vector<std::vector<double>> X; // n x d
22 std::vector<int> y; // n labels in {0,1}
23};
24
25Dataset make_synthetic(size_t n, size_t d, unsigned seed = 42) {
26 std::mt19937 rng(seed);
27 std::normal_distribution<double> N01(0.0, 1.0);
28 // True separating hyperplane
29 std::vector<double> w_true(d, 0.0);
30 for (size_t j = 0; j < d; ++j) w_true[j] = N01(rng);
31 Dataset data;
32 data.X.resize(n, std::vector<double>(d, 0.0));
33 data.y.resize(n, 0);
34 for (size_t i = 0; i < n; ++i) {
35 for (size_t j = 0; j < d; ++j) data.X[i][j] = N01(rng);
36 double z = dot(data.X[i], w_true) + 0.3 * N01(rng); // add noise
37 data.y[i] = (z > 0.0) ? 1 : 0;
38 }
39 return data;
40}
41
42// Diagonal Gaussian distribution over parameters
43struct DiagGaussian {
44 std::vector<double> mu; // mean (size d)
45 std::vector<double> log_std; // log standard deviation (size d)
46};
47
48// Sample from a diagonal Gaussian: w = mu + exp(log_std) * eps
49std::vector<double> sample(const DiagGaussian& Q, std::mt19937& rng) {
50 std::normal_distribution<double> N01(0.0, 1.0);
51 size_t d = Q.mu.size();
52 std::vector<double> w(d);
53 for (size_t j = 0; j < d; ++j) {
54 double eps = N01(rng);
55 double sigma = std::exp(Q.log_std[j]);
56 w[j] = Q.mu[j] + sigma * eps;
57 }
58 return w;
59}
60
61// KL(Q || P) for diagonal Gaussians in nats
62// Q = N(mu_q, diag(sigma_q^2)), P = N(mu_p, diag(sigma_p^2))
63double kl_diag_gaussians(const DiagGaussian& Q, const DiagGaussian& P) {
64 size_t d = Q.mu.size();
65 double kl = 0.0;
66 for (size_t j = 0; j < d; ++j) {
67 double muq = Q.mu[j], mup = P.mu[j];
68 double sq = std::exp(2.0 * Q.log_std[j]);
69 double sp = std::exp(2.0 * P.log_std[j]);
70 // 0.5 * [ (sq + (muq-mup)^2)/sp - 1 + 2*log(sp^{1/2}/sq^{1/2}) ]
71 double term = (sq + (muq - mup) * (muq - mup)) / sp - 1.0 + 2.0 * (P.log_std[j] - Q.log_std[j]);
72 kl += 0.5 * term;
73 }
74 return kl;
75}
76
77// 0-1 empirical risk for a fixed weight vector on dataset
78// Predict 1 if dot(w,x) >= 0, else 0
79double empirical_01_risk(const std::vector<double>& w, const Dataset& data) {
80 size_t n = data.X.size();
81 size_t mistakes = 0;
82 for (size_t i = 0; i < n; ++i) {
83 double z = dot(w, data.X[i]);
84 int pred = (z >= 0.0) ? 1 : 0;
85 mistakes += (pred != data.y[i]);
86 }
87 return static_cast<double>(mistakes) / static_cast<double>(n);
88}
89
90// Monte Carlo estimate of E_Q[\hat R] by sampling M weight vectors from Q
91double monte_carlo_empirical_risk(const DiagGaussian& Q, const Dataset& data, size_t M, unsigned seed = 123) {
92 std::mt19937 rng(seed);
93 double sum = 0.0;
94 for (size_t m = 0; m < M; ++m) {
95 auto w = sample(Q, rng);
96 sum += empirical_01_risk(w, data);
97 }
98 return sum / static_cast<double>(M);
99}
100
101// Binary relative entropy kl(a || b) with numerical safeguards
102double kl_bernoulli(double a, double b) {
103 const double eps = 1e-12;
104 a = std::clamp(a, eps, 1.0 - eps);
105 b = std::clamp(b, eps, 1.0 - eps);
106 return a * std::log(a / b) + (1.0 - a) * std::log((1.0 - a) / (1.0 - b));
107}
108
109// Invert kl(a || b) <= c to find the smallest b in [a,1] satisfying the inequality (upper bound on true risk)
110// Uses bisection.
111double inverse_kl_upper(double a, double c) {
112 double lo = a;
113 double hi = 1.0 - 1e-12;
114 for (int it = 0; it < 100; ++it) {
115 double mid = 0.5 * (lo + hi);
116 double val = kl_bernoulli(a, mid);
117 if (val > c) {
118 // need larger b to reduce kl(a||b)
119 lo = mid;
120 } else {
121 hi = mid;
122 }
123 }
124 return hi; // minimal b satisfying kl(a||b) <= c
125}
126
127int main() {
128 // 1) Build data
129 size_t n = 1000; // number of examples
130 size_t d = 20; // dimension
131 Dataset data = make_synthetic(n, d, 7);
132
133 // 2) Define prior P and posterior Q (diagonal Gaussians)
134 DiagGaussian P; P.mu.assign(d, 0.0); P.log_std.assign(d, std::log(1.0)); // N(0, I)
135
136 DiagGaussian Q; Q.mu.assign(d, 0.0); Q.log_std.assign(d, std::log(0.2));
137 // Set Q's mean to a rough linear classifier by least squares sign proxy (simple heuristic)
138 // Here we compute a quick pseudo-solution: w = sum_i (2*y_i-1) * x_i
139 for (size_t i = 0; i < n; ++i) {
140 double t = (data.y[i] ? 1.0 : -1.0);
141 for (size_t j = 0; j < d; ++j) Q.mu[j] += t * data.X[i][j];
142 }
143 // Normalize
144 double norm = std::sqrt(dot(Q.mu, Q.mu) + 1e-12);
145 for (size_t j = 0; j < d; ++j) Q.mu[j] /= norm;
146
147 // 3) Estimate E_Q[\hat R] by Monte Carlo
148 size_t M = 200; // number of posterior samples
149 double hat_R_Q = monte_carlo_empirical_risk(Q, data, M);
150
151 // 4) Compute KL(Q||P)
152 double KL_QP = kl_diag_gaussians(Q, P);
153
154 // 5) Compute PAC-Bayes bounds
155 double delta = 0.05; // 95% confidence
156 double penalty = (KL_QP + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n);
157
158 // Square-root explicit bound (slightly looser)
159 double bound_sqrt = hat_R_Q + std::sqrt(0.5 * penalty);
160
161 // Tight kl-inverse bound: find the smallest R s.t. kl(hat_R_Q || R) <= penalty
162 double bound_klinverse = inverse_kl_upper(hat_R_Q, penalty);
163
164 // 6) Report
165 std::cout << "n = " << n << ", d = " << d << ", M = " << M << "\n";
166 std::cout << "Empirical Gibbs risk E_Q[hat R] = " << hat_R_Q << "\n";
167 std::cout << "KL(Q||P) [nats] = " << KL_QP << "\n";
168 std::cout << "delta = " << delta << ", penalty = " << penalty << "\n";
169 std::cout << "PAC-Bayes sqrt bound: R_Q <= " << bound_sqrt << "\n";
170 std::cout << "PAC-Bayes kl-inverse bound: R_Q <= " << bound_klinverse << "\n";
171
172 return 0;
173}
174

This program builds a synthetic binary dataset, defines a prior P and posterior Q as diagonal Gaussians over a linear classifier’s weights, and estimates the empirical Gibbs risk E_Q[\hat R] by sampling M weight vectors from Q. It computes the closed-form KL(Q||P) between diagonal Gaussians, then evaluates two PAC-Bayes bounds: a convenient square-root form and a tighter kl-inverse bound obtained by solving kl(\hat R_Q || R_Q) ≤ (KL+log(2√n/δ))/n via bisection. This directly demonstrates how small empirical error and small KL lead to small generalization error under the randomized (Gibbs) classifier.

Time: O(M n d) to estimate E_Q[\hat R] + O(d) to compute KL + O(1) for kl-inverse bisectionSpace: O(n d) to store the data matrix and O(d) for parameters
Compression-to-PAC-Bayes: turn code length into a generalization bound
1#include <iostream>
2#include <cmath>
3#include <algorithm>
4
5// Given: sample size n, empirical risk hat_R in [0,1], total code length L_bits, and confidence delta.
6// Returns: square-root PAC-Bayes bound using KL = L_bits * ln 2 (nats) for a point-mass posterior.
7double pac_bayes_compression_bound(int n, double hat_R, long long L_bits, double delta) {
8 const double LN2 = std::log(2.0);
9 double KL = static_cast<double>(L_bits) * LN2; // convert bits to nats
10 double penalty = (KL + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n);
11 double bound = hat_R + std::sqrt(0.5 * penalty);
12 return std::min(1.0, std::max(0.0, bound));
13}
14
15int main() {
16 int n = 50000; // number of training examples
17 double hat_R = 0.02; // 2% training error of the compressed model
18 int d = 1'000'000; // number of weights
19 int bits_per_weight = 4; // quantization to 16 levels
20 long long header_bits = 1024; // overhead for codebook, shapes, etc.
21
22 long long L_bits = static_cast<long long>(d) * bits_per_weight + header_bits;
23 double delta = 0.05; // 95% confidence
24
25 double bound = pac_bayes_compression_bound(n, hat_R, L_bits, delta);
26
27 std::cout << "Compression-based PAC-Bayes bound (point posterior):\n";
28 std::cout << "n = " << n << ", hat_R = " << hat_R << ", L_bits = " << L_bits << ", delta = " << delta << "\n";
29 std::cout << "Generalization certificate: R <= " << bound << "\n";
30 return 0;
31}
32

This example shows the MDL connection: a deterministic (compressed) model corresponds to a point-mass posterior Q=δ_h. If the coding scheme assigns probability P(h)=2^{-L} to code length L bits, then KL(Q||P)=ln(1/P(h))=L ln 2 nats. Plugging this into the PAC-Bayes square-root bound yields a generalization guarantee purely from training error, sample size, confidence, and code length. It is a practical way to certify large networks after aggressive quantization/pruning.

Time: O(1) to compute the bound (not counting the cost to train/compress the model)Space: O(1)