📚TheoryAdvanced

PAC-Bayes Theory

Key Points

•
PAC-Bayes provides high-probability generalization bounds for randomized predictors by comparing a data-dependent posterior Q to a fixed, data-independent prior P through KL(Q||P).
•
The bound says the expected true risk under Q is close to the expected empirical risk under Q, plus a complexity penalty that depends on KL(Q||P), sample size n, and confidence $δ .$
•
A tight form uses the binary relative entropy kl(·| $∣ \cdot) : k l (E_{Q} [\hat{R}] ∣∣ E_{Q} [R]) \leq (K L (Q ∣$ |P) + $lo g$ (2 $n$ / $δ$ ) ) / n.
•
When Q concentrates on good hypotheses (e.g., flat minima in neural networks), KL(Q||P) can be small, yielding bounds much tighter than uniform convergence.
•
You can choose P to reflect prior beliefs before seeing data, and design Q (e.g., Gaussian over weights) during training, then evaluate the bound post hoc.
•
PAC-Bayes naturally connects to compression and MDL: a compressed model of L bits gives KL ≈ L $ln$ 2, turning code length into a generalization guarantee.
•
It applies directly to neural networks by taking Q as a distribution over weights and estimating $E_{Q}$ [ $\hat$ R] via sampling (reparameterization trick).
•
Modern theory uses PAC-Bayes to explain why flat minima generalize: wide regions of low loss allow low-KL posteriors that keep empirical risk small.

Prerequisites

→Probability basics — Expectations, independence, and distributions are needed to define risks and randomized predictors.
→Kullback–Leibler divergence — KL(Q||P) is the core complexity term in PAC-Bayes; understanding its meaning and units is essential.
→Empirical risk minimization (ERM) — PAC-Bayes relates empirical (training) risk to true (test) risk; ERM provides the context.
→Concentration inequalities — PAC-Bayes proofs use concentration and change-of-measure arguments akin to Chernoff/Hoeffding bounds.
→Bayesian inference / variational methods — Choosing and working with a posterior Q (e.g., Gaussian over weights) mirrors Bayesian and variational practice.
→Optimization basics — Training models and potentially tuning Q requires gradient-based optimization knowledge.
→Linear classifiers and logistic regression — Serve as simple hypothesis classes to demonstrate PAC-Bayes bounds concretely.
→C++ programming with random number generation — Implementing Monte Carlo estimates and sampling from Gaussian posteriors uses RNGs and numerical routines.

Detailed Explanation

Tap terms for definitions

01Overview

PAC-Bayes is a framework for understanding why machine learning models trained on finite data can still perform well on unseen data. Instead of analyzing a single, fixed model, PAC-Bayes studies randomized predictors: we first pick a hypothesis h at random from a posterior distribution Q (which depends on the data), and then use h to predict. The key insight is that generalization is controlled by how much Q departs from a data-independent prior P, measured by Kullback–Leibler divergence KL(Q||P). If Q is close to P (small divergence), then with high probability the expected true risk under Q is close to the expected training risk under Q, up to a term that shrinks with sample size n and increases with confidence 1−δ. This yields data-dependent bounds that can be much sharper than classical uniform convergence bounds, especially when Q places mass on a small, good region of hypothesis space. In deep learning, we commonly choose Q to be a Gaussian over weights (with mean at a trained solution and certain variance), and evaluate the bound by sampling models from Q to estimate the empirical risk. PAC-Bayes also unifies with compression and Minimum Description Length (MDL): a short code for a model (few bits) corresponds to a small KL penalty, giving a principled route from compressibility to generalization.

02Intuition & Analogies

Imagine you are buying an insurance policy for your model’s performance. The insurer asks two things: (1) how risky is your plan to pick a particular model? and (2) how much evidence do you have that your plan works on the data you saw? The plan, in PAC-Bayes, is a randomized strategy: instead of committing to one model, you draw one from a distribution Q. Your pre-data beliefs are encoded in a prior P; after seeing data, you update to Q. The insurer charges you a premium proportional to how far Q moved from P—this is KL(Q||P). If you barely changed your mind (small KL), you pay a small premium. Your proof that the plan works is your training performance E_Q[\hat R], averaged over draws from Q. The final guarantee says: your test performance E_Q[R] will be close to your training performance E_Q[\hat R], plus a surcharge that grows with the premium (KL) and shrinks with more data (n). Why does this capture flat minima? Picture balancing a marble in a valley. A flat valley lets you wiggle the marble (sample different models around the solution) without increasing loss much. That means you can use a relatively broad Q (spread out posterior) and still have low empirical risk. However, a broader Q also increases KL unless P is broad too; in a flat valley, you can still keep KL modest because the posterior doesn’t need to concentrate razor-thinly to stay accurate. In contrast, a sharp minimum forces Q to be very narrow to keep training error low; narrow Q often means larger KL relative to a simple prior, worsening the bound.

03Formal Definition

Let H be a hypothesis class and \(

ℓ

Y \times Y \to

[0,1]\) a bounded loss (e.g., 0–1). A randomized (Gibbs) predictor draws \(h

\sim

Q\) and predicts using h. For data distribution \(

D

\) over \(

X \times Y

\) and sample \(S=\{(

x_{i}

y_{i}

)\}_{

i=1}^{n}\

) i.i.d., define the true risk of h: \(R(h) =

E_{(x, y) \sim D}

[

ℓ

(h(x),y)]\) and empirical risk \(

\hat

R(h) =

\frac{1}{n} \sum_{i = 1}^{n}

ℓ

(h(

x_{i}

y_{i}

)\). The Gibbs (posterior-averaged) risks are \(

R_{Q}

E_{h \sim Q}

[R(h)]\) and \(

\hat

R_{Q}

E_{h \sim Q}

[

\hat

R(h)]\). Given a data-independent prior distribution P over H and any posterior Q (possibly chosen after seeing S), the classical PAC-Bayes-kl bound (McAllester) states that with probability at least \(1-

δ

\) over the draw of S, for all Q: \[

kl

(

\hat

R_{Q}

\,\

|\,R_Q) \le \frac{\mathrm{KL}(Q\|

P) +

lo g

\left

(

\tfrac{2\sqrt{n}}{\delta}\right

)}{n}, \] where \(

kl

(a\

|b) = a\log\frac{a}{b} + (1-a)\log\frac{1-a}{1-b}\) is the binary relative entropy and \(\mathrm{KL}(Q\|

P) =

E_{h \sim Q}

[

lo g

\frac{q ( h )}{p ( h )}

]\) is in nats. Inverted to bound \(

R_{Q}

\), a convenient (though slightly looser) form is \[

R_{Q}

\leq

\hat

R_{Q}

\sqrt

{

\frac

{

KL

(Q\|P) +

lo g

\left

(

\tfrac{2\sqrt{n}}{\delta}\right

)}{2n}}. \] Numerous variants exist (Catoni/Bernstein forms for real-valued losses, hierarchical or data-dependent priors via extra KL terms, and bounds on majority votes), but all share the same skeleton: data fit under Q plus a KL complexity penalty.

04When to Use

Use PAC-Bayes when you have (or can posit) a distribution over models and want a data-dependent generalization guarantee. Typical situations include: (1) Bayesian or variational training of neural networks, where Q is a Gaussian over weights and P is a simple zero-mean Gaussian prior; (2) ensemble or stochastic predictors (e.g., dropout), where prediction itself is randomized; (3) analyzing flat minima by measuring loss under random perturbations around a solution—if perturbed models remain good, E_Q[\hat R] is low for a reasonably spread Q; (4) compression-based analysis: if your trained model compresses to L bits, treat P(h)=2^{-L} (or use a coding scheme) to turn code length into a KL term and obtain a guarantee; (5) hyperparameter or architecture selection with a hyper-prior: you can pick a prior from a family using a held-out split or pay an extra KL to formalize the choice. Prefer PAC-Bayes over uniform VC/Rademacher bounds when the posterior Q is highly concentrated on a small, good region of H, making KL(Q||P) small even if H is huge (as in deep nets). Avoid it if you cannot sensibly specify a data-independent prior (or hyper-prior), if your losses are unbounded without using suitable variants (e.g., Catoni), or if you only care about a single deterministic predictor without relating it to a Gibbs/averaged predictor (though majority-vote and deterministic conversions exist).

⚠️Common Mistakes

• Data-dependent prior without correction: Choosing P after seeing S invalidates the standard bound. Fix P before data, or use a hyper-prior π and pay an extra term KL(P|\pi), or split data to select P. • Mixing units: KL must be in nats for the natural logarithm. If you measure compression in bits, multiply by (\ln 2). • Estimating E_Q[\hat R] poorly: Using too few posterior samples can make the Monte Carlo estimate high-variance, loosening bounds. Increase samples, use common random numbers, and ensure reproducibility. • Misapplying the square-root bound: The simple (\sqrt{\cdot}) form assumes bounded (often 0–1) losses; for real-valued losses, use Catoni/Bernstein PAC-Bayes variants. • Ignoring confidence δ: Picking δ after looking at results is multiple testing. Decide δ a priori, or correct with union bounds across many evaluations. • Forgetting the predictor is Gibbs: The bound is on the randomized (draw-then-predict) strategy. To bound a deterministic predictor, use majority-vote/C-bounds or relate the deterministic model to a nearby randomized one and argue closeness of risks. • Overfitting to the bound: Tuning Q to minimize the numerical bound on the same S without care can overfit the certificate. Use validation, report δ, and avoid repeated peeking without correction. • Inconsistent priors: Using a very narrow prior P mismatched to plausible solutions inflates KL(Q||P). Choose P to reflect symmetries (e.g., zero-mean isotropic Gaussians for weights) to keep KL manageable.

Key Formulas

Gibbs (posterior-averaged) risk

R_{Q} = E_{h \sim Q} E_{(x, y) \sim D} [ℓ (h (x), y)]

Explanation: This is the expected true risk when you first draw a model from the posterior Q and then evaluate its loss on the data distribution. It is the main target of PAC-Bayes bounds.

Empirical Gibbs risk

\hat{R}_{Q} = E_{h \sim Q} \frac{1}{n} i = 1 \sum n ℓ (h (x_{i}), y_{i})

Explanation: This is the expected training loss averaged over models sampled from Q. It is observable from data and approximated by Monte Carlo.

PAC-Bayes-kl (McAllester)

kl (\hat{R}_{Q} ∥ R_{Q}) \leq \frac{KL ( Q ∥ P ) + lo g ( \frac{2 n}{δ} )}{n}

Explanation: With probability at least 1−δ over the sample, this inequality holds simultaneously for all posteriors Q. It tightly links training error, test error, and the complexity term KL(Q||P).

Square-root PAC-Bayes bound

R_{Q} \leq \hat{R}_{Q} + \frac{KL ( Q ∥ P ) + lo g ( \frac{2 n}{δ} )}{2 n}

Explanation: A convenient, explicit upper bound on the true risk obtained by inverting the kl-inequality via Pinsker-type arguments. It is slightly looser but easy to compute.

KL divergence (definition)

KL (Q ∥ P) = E_{h \sim Q} [lo g \frac{q ( h )}{p ( h )}]

Explanation: Measures how far the posterior Q moves from the prior P, in nats when using natural logs. It acts as the complexity penalty in PAC-Bayes bounds.

KL between diagonal Gaussians

KL (N (μ_{Q}, diag (σ_{Q}^{2})) ∥ N (μ_{P}, diag (σ_{P}^{2}))) = \frac{1}{2} j = 1 \sum d [\frac{σ _{Q, j}^{2} + ( μ _{Q, j} - μ _{P, j} ) ^{2}}{σ _{P, j}^{2}} - 1 + 2 lo g \frac{σ _{P, j}}{σ _{Q, j}}]

Explanation: Closed-form KL used when Q and P are factorized Gaussians over parameters. It scales linearly with dimension and is cheap to compute.

Binary relative entropy

kl (a ∥ b) = a lo g \frac{a}{b} + (1 - a) lo g \frac{1 - a}{1 - b}

Explanation: This is the KL divergence between Bernoulli(a) and Bernoulli(b). It appears in the tight PAC-Bayes-kl bound for 0–1 losses.

Reparameterization

w = μ + σ ⊙ ϵ, ϵ \sim N (0, I)

Explanation: Sampling from a diagonal Gaussian posterior is implemented by perturbing the mean with scaled standard normal noise. This enables Monte Carlo estimates of $E_{Q}$ [ $\hat$ R].

Deterministic posterior KL

KL (δ_{h} ∥ P) = lo g \frac{1}{P ( h )}

Explanation: If Q is a point mass at h, the KL reduces to the negative log-prior probability of h. Under a coding prior P(h)=2^{-L}, this equals L ln 2, linking compression to generalization.

Catoni-style PAC-Bayes

R_{Q} \leq \frac{1}{1 - e ^{- λ}} (1 - exp (- λ \hat{R}_{Q} - \frac{KL ( Q ∥ P ) + lo g ( 1/ δ )}{n}))

Explanation: A sharper bound for bounded real-valued losses using an exponential transform. The parameter λ balances fit and complexity.

Complexity Analysis

For practical evaluation, the main costs are (i) estimating the empirical Gibbs risk

E_{Q}

[

\hat

R] via Monte Carlo and (ii) computing KL(Q|

∣ P) . S u pp ose w e ha v e n e x am pl es, d p a r am e t ers, an dd r a wM s am pl es f ro m Q . F or a l in e a ror l o g i s t i c m o d e l, e v a l u a t in g o n e m o d e l o n t h e d a t a se t i s O (n d) t im e (d o tp ro d u c t s p ere x am pl e), so M o n t e C a r l oes t ima t i o n cos t s O (M n d) . T h eK L b e tw ee n tw o d ia g o na lG a u ss ian s i s O (d), d o mina t e d b y t h e M o n t e C a r l o p a ss w h e n M n ≫ 1. T h e in v erse - k l b o u n d re q u i resre p e a t e d l ye v a l u a t in g t h e bina ry k l (a ∣

|b) and performing a bisection search over b ∈ [a,1]; each iteration is O(1), and a small fixed number (≈50–100) of iterations suffices for high precision, yielding O(1) extra time. Space usage is O(n d) if you materialize the data matrix, or O(n + d) if you stream examples and store parameters and statistics only. If the hypothesis evaluation is a neural network forward pass, replace O(d) per example by the cost of one forward pass; the overall scaling remains linear in M and n. Increasing M reduces Monte Carlo variance but linearly increases time; using common random numbers and low-variance estimators can improve accuracy for a fixed budget. Compression-based bounds avoid Monte Carlo for

E_{Q}

[

\hat

R] if you use a deterministic model (Q is a point mass), in which case you only pay for a single dataset evaluation O(n · cost(model)) and compute KL from the code length in O(1) or O(d), depending on the coding scheme. In summary, PAC-Bayes evaluation is typically linear in dataset size and in the number of posterior samples, with modest memory overhead.

Code Examples

PAC-Bayes bound for a Gaussian posterior over a linear classifier (Monte Carlo + kl-inverse)

1 #include <iostream>
2 #include <vector>
3 #include <random>
4 #include <cmath>
5 #include <algorithm>
6 #include <numeric>
7 #include <limits>
8 
9 // Utility: dot product
10 double dot(const std::vector<double>& a, const std::vector<double>& b) {
11     double s = 0.0;
12     for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
13     return s;
14 }
15 
16 // Sigmoid for logistic prediction (not strictly needed for 0-1 with sign, but illustrative)
17 double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); }
18 
19 // Generate a synthetic binary classification dataset
20 struct Dataset {
21     std::vector<std::vector<double>> X; // n x d
22     std::vector<int> y;                 // n labels in {0,1}
23 };
24 
25 Dataset make_synthetic(size_t n, size_t d, unsigned seed = 42) {
26     std::mt19937 rng(seed);
27     std::normal_distribution<double> N01(0.0, 1.0);
28     // True separating hyperplane
29     std::vector<double> w_true(d, 0.0);
30     for (size_t j = 0; j < d; ++j) w_true[j] = N01(rng);
31     Dataset data;
32     data.X.resize(n, std::vector<double>(d, 0.0));
33     data.y.resize(n, 0);
34     for (size_t i = 0; i < n; ++i) {
35         for (size_t j = 0; j < d; ++j) data.X[i][j] = N01(rng);
36         double z = dot(data.X[i], w_true) + 0.3 * N01(rng); // add noise
37         data.y[i] = (z > 0.0) ? 1 : 0;
38     }
39     return data;
40 }
41 
42 // Diagonal Gaussian distribution over parameters
43 struct DiagGaussian {
44     std::vector<double> mu;       // mean (size d)
45     std::vector<double> log_std;  // log standard deviation (size d)
46 };
47 
48 // Sample from a diagonal Gaussian: w = mu + exp(log_std) * eps
49 std::vector<double> sample(const DiagGaussian& Q, std::mt19937& rng) {
50     std::normal_distribution<double> N01(0.0, 1.0);
51     size_t d = Q.mu.size();
52     std::vector<double> w(d);
53     for (size_t j = 0; j < d; ++j) {
54         double eps = N01(rng);
55         double sigma = std::exp(Q.log_std[j]);
56         w[j] = Q.mu[j] + sigma * eps;
57     }
58     return w;
59 }
60 
61 // KL(Q || P) for diagonal Gaussians in nats
62 // Q = N(mu_q, diag(sigma_q^2)), P = N(mu_p, diag(sigma_p^2))
63 double kl_diag_gaussians(const DiagGaussian& Q, const DiagGaussian& P) {
64     size_t d = Q.mu.size();
65     double kl = 0.0;
66     for (size_t j = 0; j < d; ++j) {
67         double muq = Q.mu[j], mup = P.mu[j];
68         double sq = std::exp(2.0 * Q.log_std[j]);
69         double sp = std::exp(2.0 * P.log_std[j]);
70         // 0.5 * [ (sq + (muq-mup)^2)/sp - 1 + 2*log(sp^{1/2}/sq^{1/2}) ]
71         double term = (sq + (muq - mup) * (muq - mup)) / sp - 1.0 + 2.0 * (P.log_std[j] - Q.log_std[j]);
72         kl += 0.5 * term;
73     }
74     return kl;
75 }
76 
77 // 0-1 empirical risk for a fixed weight vector on dataset
78 // Predict 1 if dot(w,x) >= 0, else 0
79 double empirical_01_risk(const std::vector<double>& w, const Dataset& data) {
80     size_t n = data.X.size();
81     size_t mistakes = 0;
82     for (size_t i = 0; i < n; ++i) {
83         double z = dot(w, data.X[i]);
84         int pred = (z >= 0.0) ? 1 : 0;
85         mistakes += (pred != data.y[i]);
86     }
87     return static_cast<double>(mistakes) / static_cast<double>(n);
88 }
89 
90 // Monte Carlo estimate of E_Q[\hat R] by sampling M weight vectors from Q
91 double monte_carlo_empirical_risk(const DiagGaussian& Q, const Dataset& data, size_t M, unsigned seed = 123) {
92     std::mt19937 rng(seed);
93     double sum = 0.0;
94     for (size_t m = 0; m < M; ++m) {
95         auto w = sample(Q, rng);
96         sum += empirical_01_risk(w, data);
97     }
98     return sum / static_cast<double>(M);
99 }
100 
101 // Binary relative entropy kl(a || b) with numerical safeguards
102 double kl_bernoulli(double a, double b) {
103     const double eps = 1e-12;
104     a = std::clamp(a, eps, 1.0 - eps);
105     b = std::clamp(b, eps, 1.0 - eps);
106     return a * std::log(a / b) + (1.0 - a) * std::log((1.0 - a) / (1.0 - b));
107 }
108 
109 // Invert kl(a || b) <= c to find the smallest b in [a,1] satisfying the inequality (upper bound on true risk)
110 // Uses bisection.
111 double inverse_kl_upper(double a, double c) {
112     double lo = a;
113     double hi = 1.0 - 1e-12;
114     for (int it = 0; it < 100; ++it) {
115         double mid = 0.5 * (lo + hi);
116         double val = kl_bernoulli(a, mid);
117         if (val > c) {
118             // need larger b to reduce kl(a||b)
119             lo = mid;
120         } else {
121             hi = mid;
122         }
123     }
124     return hi; // minimal b satisfying kl(a||b) <= c
125 }
126 
127 int main() {
128     // 1) Build data
129     size_t n = 1000;          // number of examples
130     size_t d = 20;            // dimension
131     Dataset data = make_synthetic(n, d, 7);
132 
133     // 2) Define prior P and posterior Q (diagonal Gaussians)
134     DiagGaussian P; P.mu.assign(d, 0.0); P.log_std.assign(d, std::log(1.0)); // N(0, I)
135 
136     DiagGaussian Q; Q.mu.assign(d, 0.0); Q.log_std.assign(d, std::log(0.2));
137     // Set Q's mean to a rough linear classifier by least squares sign proxy (simple heuristic)
138     // Here we compute a quick pseudo-solution: w = sum_i (2*y_i-1) * x_i
139     for (size_t i = 0; i < n; ++i) {
140         double t = (data.y[i] ? 1.0 : -1.0);
141         for (size_t j = 0; j < d; ++j) Q.mu[j] += t * data.X[i][j];
142     }
143     // Normalize
144     double norm = std::sqrt(dot(Q.mu, Q.mu) + 1e-12);
145     for (size_t j = 0; j < d; ++j) Q.mu[j] /= norm;
146 
147     // 3) Estimate E_Q[\hat R] by Monte Carlo
148     size_t M = 200; // number of posterior samples
149     double hat_R_Q = monte_carlo_empirical_risk(Q, data, M);
150 
151     // 4) Compute KL(Q||P)
152     double KL_QP = kl_diag_gaussians(Q, P);
153 
154     // 5) Compute PAC-Bayes bounds
155     double delta = 0.05; // 95% confidence
156     double penalty = (KL_QP + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n);
157 
158     // Square-root explicit bound (slightly looser)
159     double bound_sqrt = hat_R_Q + std::sqrt(0.5 * penalty);
160 
161     // Tight kl-inverse bound: find the smallest R s.t. kl(hat_R_Q || R) <= penalty
162     double bound_klinverse = inverse_kl_upper(hat_R_Q, penalty);
163 
164     // 6) Report
165     std::cout << "n = " << n << ", d = " << d << ", M = " << M << "\n";
166     std::cout << "Empirical Gibbs risk E_Q[hat R] = " << hat_R_Q << "\n";
167     std::cout << "KL(Q||P) [nats] = " << KL_QP << "\n";
168     std::cout << "delta = " << delta << ", penalty = " << penalty << "\n";
169     std::cout << "PAC-Bayes sqrt bound: R_Q <= " << bound_sqrt << "\n";
170     std::cout << "PAC-Bayes kl-inverse bound: R_Q <= " << bound_klinverse << "\n";
171 
172     return 0;
173 }
174

This program builds a synthetic binary dataset, defines a prior P and posterior Q as diagonal Gaussians over a linear classifier’s weights, and estimates the empirical Gibbs risk E_Q[\hat R] by sampling M weight vectors from Q. It computes the closed-form KL(Q||P) between diagonal Gaussians, then evaluates two PAC-Bayes bounds: a convenient square-root form and a tighter kl-inverse bound obtained by solving kl(\hat R_Q || R_Q) ≤ (KL+log(2√n/δ))/n via bisection. This directly demonstrates how small empirical error and small KL lead to small generalization error under the randomized (Gibbs) classifier.

Time: O(M n d) to estimate E_Q[\hat R] + O(d) to compute KL + O(1) for kl-inverse bisectionSpace: O(n d) to store the data matrix and O(d) for parameters

Compression-to-PAC-Bayes: turn code length into a generalization bound

1 #include <iostream>
2 #include <cmath>
3 #include <algorithm>
4 
5 // Given: sample size n, empirical risk hat_R in [0,1], total code length L_bits, and confidence delta.
6 // Returns: square-root PAC-Bayes bound using KL = L_bits * ln 2 (nats) for a point-mass posterior.
7 double pac_bayes_compression_bound(int n, double hat_R, long long L_bits, double delta) {
8     const double LN2 = std::log(2.0);
9     double KL = static_cast<double>(L_bits) * LN2; // convert bits to nats
10     double penalty = (KL + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n);
11     double bound = hat_R + std::sqrt(0.5 * penalty);
12     return std::min(1.0, std::max(0.0, bound));
13 }
14 
15 int main() {
16     int n = 50000;                  // number of training examples
17     double hat_R = 0.02;            // 2% training error of the compressed model
18     int d = 1'000'000;              // number of weights
19     int bits_per_weight = 4;        // quantization to 16 levels
20     long long header_bits = 1024;   // overhead for codebook, shapes, etc.
21 
22     long long L_bits = static_cast<long long>(d) * bits_per_weight + header_bits;
23     double delta = 0.05;            // 95% confidence
24 
25     double bound = pac_bayes_compression_bound(n, hat_R, L_bits, delta);
26 
27     std::cout << "Compression-based PAC-Bayes bound (point posterior):\n";
28     std::cout << "n = " << n << ", hat_R = " << hat_R << ", L_bits = " << L_bits << ", delta = " << delta << "\n";
29     std::cout << "Generalization certificate: R <= " << bound << "\n";
30     return 0;
31 }
32

This example shows the MDL connection: a deterministic (compressed) model corresponds to a point-mass posterior Q=δ_h. If the coding scheme assigns probability P(h)=2^{-L} to code length L bits, then KL(Q||P)=ln(1/P(h))=L ln 2 nats. Plugging this into the PAC-Bayes square-root bound yields a generalization guarantee purely from training error, sample size, confidence, and code length. It is a practical way to certify large networks after aggressive quantization/pruning.

Time: O(1) to compute the bound (not counting the cost to train/compress the model)Space: O(1)

1	#include <iostream>
2	#include <vector>
3	#include <random>
4	#include <cmath>
5	#include <algorithm>
6	#include <numeric>
7	#include <limits>
8
9	// Utility: dot product
10	double dot(const std::vector<double>& a, const std::vector<double>& b) {
11	double s = 0.0;
12	for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i];
13	return s;
14	}
15
16	// Sigmoid for logistic prediction (not strictly needed for 0-1 with sign, but illustrative)
17	double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); }
18
19	// Generate a synthetic binary classification dataset
20	struct Dataset {
21	std::vector<std::vector<double>> X; // n x d
22	std::vector<int> y; // n labels in {0,1}
23	};
24
25	Dataset make_synthetic(size_t n, size_t d, unsigned seed = 42) {
26	std::mt19937 rng(seed);
27	std::normal_distribution<double> N01(0.0, 1.0);
28	// True separating hyperplane
29	std::vector<double> w_true(d, 0.0);
30	for (size_t j = 0; j < d; ++j) w_true[j] = N01(rng);
31	Dataset data;
32	data.X.resize(n, std::vector<double>(d, 0.0));
33	data.y.resize(n, 0);
34	for (size_t i = 0; i < n; ++i) {
35	for (size_t j = 0; j < d; ++j) data.X[i][j] = N01(rng);
36	double z = dot(data.X[i], w_true) + 0.3 * N01(rng); // add noise
37	data.y[i] = (z > 0.0) ? 1 : 0;
38	}
39	return data;
40	}
41
42	// Diagonal Gaussian distribution over parameters
43	struct DiagGaussian {
44	std::vector<double> mu; // mean (size d)
45	std::vector<double> log_std; // log standard deviation (size d)
46	};
47
48	// Sample from a diagonal Gaussian: w = mu + exp(log_std) * eps
49	std::vector<double> sample(const DiagGaussian& Q, std::mt19937& rng) {
50	std::normal_distribution<double> N01(0.0, 1.0);
51	size_t d = Q.mu.size();
52	std::vector<double> w(d);
53	for (size_t j = 0; j < d; ++j) {
54	double eps = N01(rng);
55	double sigma = std::exp(Q.log_std[j]);
56	w[j] = Q.mu[j] + sigma * eps;
57	}
58	return w;
59	}
60
61	// KL(Q \|\| P) for diagonal Gaussians in nats
62	// Q = N(mu_q, diag(sigma_q^2)), P = N(mu_p, diag(sigma_p^2))
63	double kl_diag_gaussians(const DiagGaussian& Q, const DiagGaussian& P) {
64	size_t d = Q.mu.size();
65	double kl = 0.0;
66	for (size_t j = 0; j < d; ++j) {
67	double muq = Q.mu[j], mup = P.mu[j];
68	double sq = std::exp(2.0 * Q.log_std[j]);
69	double sp = std::exp(2.0 * P.log_std[j]);
70	// 0.5 * [ (sq + (muq-mup)^2)/sp - 1 + 2*log(sp^{1/2}/sq^{1/2}) ]
71	double term = (sq + (muq - mup) * (muq - mup)) / sp - 1.0 + 2.0 * (P.log_std[j] - Q.log_std[j]);
72	kl += 0.5 * term;
73	}
74	return kl;
75	}
76
77	// 0-1 empirical risk for a fixed weight vector on dataset
78	// Predict 1 if dot(w,x) >= 0, else 0
79	double empirical_01_risk(const std::vector<double>& w, const Dataset& data) {
80	size_t n = data.X.size();
81	size_t mistakes = 0;
82	for (size_t i = 0; i < n; ++i) {
83	double z = dot(w, data.X[i]);
84	int pred = (z >= 0.0) ? 1 : 0;
85	mistakes += (pred != data.y[i]);
86	}
87	return static_cast<double>(mistakes) / static_cast<double>(n);
88	}
89
90	// Monte Carlo estimate of E_Q[\hat R] by sampling M weight vectors from Q
91	double monte_carlo_empirical_risk(const DiagGaussian& Q, const Dataset& data, size_t M, unsigned seed = 123) {
92	std::mt19937 rng(seed);
93	double sum = 0.0;
94	for (size_t m = 0; m < M; ++m) {
95	auto w = sample(Q, rng);
96	sum += empirical_01_risk(w, data);
97	}
98	return sum / static_cast<double>(M);
99	}
100
101	// Binary relative entropy kl(a \|\| b) with numerical safeguards
102	double kl_bernoulli(double a, double b) {
103	const double eps = 1e-12;
104	a = std::clamp(a, eps, 1.0 - eps);
105	b = std::clamp(b, eps, 1.0 - eps);
106	return a * std::log(a / b) + (1.0 - a) * std::log((1.0 - a) / (1.0 - b));
107	}
108
109	// Invert kl(a \|\| b) <= c to find the smallest b in [a,1] satisfying the inequality (upper bound on true risk)
110	// Uses bisection.
111	double inverse_kl_upper(double a, double c) {
112	double lo = a;
113	double hi = 1.0 - 1e-12;
114	for (int it = 0; it < 100; ++it) {
115	double mid = 0.5 * (lo + hi);
116	double val = kl_bernoulli(a, mid);
117	if (val > c) {
118	// need larger b to reduce kl(a\|\|b)
119	lo = mid;
120	} else {
121	hi = mid;
122	}
123	}
124	return hi; // minimal b satisfying kl(a\|\|b) <= c
125	}
126
127	int main() {
128	// 1) Build data
129	size_t n = 1000; // number of examples
130	size_t d = 20; // dimension
131	Dataset data = make_synthetic(n, d, 7);
132
133	// 2) Define prior P and posterior Q (diagonal Gaussians)
134	DiagGaussian P; P.mu.assign(d, 0.0); P.log_std.assign(d, std::log(1.0)); // N(0, I)
135
136	DiagGaussian Q; Q.mu.assign(d, 0.0); Q.log_std.assign(d, std::log(0.2));
137	// Set Q's mean to a rough linear classifier by least squares sign proxy (simple heuristic)
138	// Here we compute a quick pseudo-solution: w = sum_i (2y_i-1) x_i
139	for (size_t i = 0; i < n; ++i) {
140	double t = (data.y[i] ? 1.0 : -1.0);
141	for (size_t j = 0; j < d; ++j) Q.mu[j] += t * data.X[i][j];
142	}
143	// Normalize
144	double norm = std::sqrt(dot(Q.mu, Q.mu) + 1e-12);
145	for (size_t j = 0; j < d; ++j) Q.mu[j] /= norm;
146
147	// 3) Estimate E_Q[\hat R] by Monte Carlo
148	size_t M = 200; // number of posterior samples
149	double hat_R_Q = monte_carlo_empirical_risk(Q, data, M);
150
151	// 4) Compute KL(Q\|\|P)
152	double KL_QP = kl_diag_gaussians(Q, P);
153
154	// 5) Compute PAC-Bayes bounds
155	double delta = 0.05; // 95% confidence
156	double penalty = (KL_QP + std::log(2.0 * std::sqrt(static_cast<double>(n)) / delta)) / static_cast<double>(n);
157
158	// Square-root explicit bound (slightly looser)
159	double bound_sqrt = hat_R_Q + std::sqrt(0.5 * penalty);
160
161	// Tight kl-inverse bound: find the smallest R s.t. kl(hat_R_Q \|\| R) <= penalty
162	double bound_klinverse = inverse_kl_upper(hat_R_Q, penalty);
163
164	// 6) Report
165	std::cout << "n = " << n << ", d = " << d << ", M = " << M << "\n";
166	std::cout << "Empirical Gibbs risk E_Q[hat R] = " << hat_R_Q << "\n";
167	std::cout << "KL(Q\|\|P) [nats] = " << KL_QP << "\n";
168	std::cout << "delta = " << delta << ", penalty = " << penalty << "\n";
169	std::cout << "PAC-Bayes sqrt bound: R_Q <= " << bound_sqrt << "\n";
170	std::cout << "PAC-Bayes kl-inverse bound: R_Q <= " << bound_klinverse << "\n";
171
172	return 0;
173	}
174