📚TheoryAdvanced

Information-Theoretic Lower Bounds

Key Points

•
Information-theoretic lower bounds tell you the best possible performance any learning algorithm can achieve, regardless of cleverness or compute.
•
Fano’s inequality links prediction error to mutual information, proving that too little information in the data forces high error.
•
Le Cam’s method and Assouad’s lemma turn hard two-point or hypercube constructions into minimax lower bounds on estimation risk.
•
Any comparison-based sorting algorithm needs at least Ω(n log n) comparisons because it must resolve among n! possible orders.
•
Communication complexity and the statistical query model show oracle-based limits when access to data is restricted or noisy.
•
Differential privacy imposes accuracy costs; for example, private mean estimation must incur at least Ω(1/(n $ε))$ error.
•
Cryptographic assumptions imply hardness of some learning tasks even with abundant data, limiting polynomial-time learnability.
•
These bounds guide ML practice by clarifying when more data, more features, or weaker privacy are necessary for target accuracy.

Prerequisites

→Probability Theory Basics — Entropy, mutual information, and KL divergence are defined over random variables and distributions.
→Asymptotic Analysis — Understanding big-O, Stirling’s approximation, and scaling with n and d is essential for interpreting bounds.
→Hypothesis Testing — Le Cam’s method reduces estimation to testing; total variation and error probabilities come from testing theory.
→Information Theory — Core concepts like entropy, mutual information, and data processing underpin Fano and related inequalities.
→Optimization and Convexity — Minimax definitions and some proofs use convexity, Jensen’s inequality, and duality ideas.
→Algorithmic Models — Decision trees, oracles, and communication protocols define the resources being lower bounded.
→Differential Privacy Fundamentals — To understand privacy–accuracy trade-offs and sensitivity calibration for mechanisms like Laplace.
→Discrete Mathematics — Combinatorics over permutations (n!), hypercube constructions, and counting arguments appear frequently.

Detailed Explanation

Tap terms for definitions

01Overview

Information-theoretic lower bounds are rigorous statements about what no algorithm can do, regardless of strategy, runtime, or computational power (unless otherwise restricted). They quantify minimal sample sizes, unavoidable errors, or number of queries required to solve learning and decision problems. The core idea is to measure the information the observed data carries about the unknown quantity (labels, parameters, or target function). If the information is too small, then any method must fail with some probability or suffer a certain level of error. Classic tools include Fano’s inequality, which lower bounds classification error with mutual information; Le Cam’s two-point method, which converts the difficulty of distinguishing two close distributions into estimation lower bounds; and Assouad’s lemma, which leverages a hypercube of hard instances to scale lower bounds with dimension. Outside pure statistics, decision-tree and communication complexity prove inherent limits when only comparisons or limited messages are allowed. In privacy-constrained learning, differential privacy creates necessary noise that limits accuracy. Together, these results define what is fundamentally achievable in AI/ML, independent of any particular algorithm, and help practitioners set realistic targets for data collection, model design, and privacy choices.

02Intuition & Analogies

Imagine you’re trying to guess a hidden word with yes/no questions. If there are a million possibilities, you need about 20 questions (since 2^{ $20} \approx 10$ ^6). With fewer questions, many words look indistinguishable—no strategy beats that. Learning works similarly: your data provides a finite amount of information (like answers to a limited set of questions). If the task requires resolving among many possibilities, and the data doesn’t carry enough information, then error is unavoidable. Fano’s inequality formalizes this: if the mutual information between features X and label Y is small, even the best classifier must mislabel often. Le Cam’s method is like testing whether a coin is slightly biased toward heads or toward tails; if the bias is tiny and you flip only a few times, the two worlds look almost the same, so any decision will be wrong with substantial probability—hence any estimator of the bias must have error at least that large. Assouad’s lemma scales this idea: instead of two worlds, consider a d-dimensional switchboard with many on/off settings; if each coordinate is hard to detect, then the total difficulty multiplies by d. In comparison-based sorting, you learn the order only through pairwise comparisons—equivalently, binary questions—so resolving one of n! permutations needs about log2(n!) answers. Privacy is another intuition pump: to hide any one person’s data, you must inject noise. That noise erases some information, so accuracy must degrade, especially when you want strong privacy (small ε) or you don’t have much data (small n).

03Formal Definition

Lower bounds typically compare a target performance metric to a function of information or separations among distributions. For classification with labels Y in an M-class set, Fano’s inequality states that any estimator \(

\hat{Y}

(X)\) has error \(

P_{e}

Pr

[

\hat{Y}

\neq =

Y]\) at least \(1 - (I(X;Y) +

lo g

2)/

lo g

M\), where $I(X;Y)$ is mutual information and logs are base-2. For parameter estimation with loss \(

ρ

(

\hat{θ}

θ

)\), the minimax risk \(

R_{n}

in f_{\hat{θ}}

sup_{θ}

E_{θ}

[

ρ

(

\hat{θ}

θ

)]\) can be lower bounded using Le Cam’s method via a two-point prior: picking two parameters \(

θ_{0}

θ_{1}

\) with separation \(

Δ

\), one gets \(

R_{n}

\geq

(

Δ

/2) (1 -

TV

(

P_{θ_{0}}^{n}

P_{θ_{1}}^{n}

))\), where \(

TV

\) is total variation and \(

P_{θ}^{n}

\) is the n-fold product distribution. Assouad’s lemma extends this by constructing a hypercube of parameters and summing coordinate-wise difficulties, yielding \(

R_{n}

≳

\times

(single-bit hardness)

\) for suitable metrics. In decision-tree models (e.g., comparison sorting), every algorithm induces a binary tree over outcomes; distinguishing among n! permutations needs depth at least \(

⌈

lo g_{2}

(n!)

⌉

\), giving an \(

Ω

lo g

n)\) lower bound on comparisons. Communication complexity lower bounds—often via fooling sets or information complexity—show minimal bits two parties must exchange to compute a function. In the Statistical Query model, the number and tolerance of noisy expectation queries required is constrained by the SQ-dimension. Under \(

ε

\)-differential privacy, accuracy lower bounds scale with \(1/(n

ε

)\) for many tasks because the mechanism must mask any individual’s contribution.

04When to Use

Use information-theoretic lower bounds when you want to prove that no algorithm can achieve a certain accuracy with limited data or restricted access. Examples include: (1) Sample complexity planning: before training a model, estimate how many labeled examples are required to beat chance, using Fano’s inequality with an approximate mutual information budget. (2) Estimator optimality: after designing an algorithm, compare its risk to a Le Cam or Assouad lower bound to argue it is minimax-optimal (up to constants). (3) Oracle-restricted settings: if your algorithm only uses comparisons (sorting, selection) or noisy statistical queries (large-scale privacy-preserving learning), rely on decision-tree or SQ lower bounds. (4) Communication-limited distributed learning: bound the required communication to reach a target error; if your budget is less than the lower bound, rethink the architecture. (5) Privacy-constrained analytics: when deploying (\varepsilon)-DP, use known (\Omega(1/(n\varepsilon))) accuracy limits to choose (n) or (\varepsilon) that achieve acceptable utility. (6) Hardness from cryptography: if a problem is believed hard under standard cryptographic assumptions, treat that as a computational lower bound and avoid seeking polynomial-time exact learners. In short, turn to these tools before investing in algorithmic tweaks, to understand whether limitations come from insufficient information, too strong constraints, or the problem’s inherent hardness.

⚠️Common Mistakes

• Mixing up bases of logarithms: Fano’s inequality and entropy must use the same log base; switching between nats and bits without adjusting constants leads to wrong numeric bounds. • Ignoring model assumptions: Le Cam and Assouad require carefully chosen pairs or hypercubes and a metric consistent with your loss; picking pairs that are too easy to distinguish yields vacuous bounds. • Conflating worst-case minimax with average-case: lower bounds may be worst-case over parameters, while your application is average-case; interpret appropriately. • Overlooking adaptivity/interaction: some bounds assume non-adaptive queries; adaptive algorithms can change the information budget (though data processing still holds). • Treating upper bounds as lower bounds: sample complexity upper bounds from VC or Rademacher theory do not imply impossibility; you need explicit lower-bound constructions. • Misapplying data processing: DPI says post-processing cannot increase mutual information about the target, but it doesn’t quantify how much it decreases in your specific pipeline. • Privacy constants and composition: DP lower bounds depend on the definition (pure vs approximate DP) and composition; ignoring these can under- or over-estimate limits. • Decision-tree model mismatch: using the (\Omega(n\log n)) sorting lower bound to claim impossibility for non-comparison models (e.g., radix sort on integers with bounded word size) is incorrect; the model matters.

Key Formulas

Entropy

H (X) = - x \sum p (x) lo g p (x)

Explanation: Entropy measures uncertainty in X. Use base-2 logs to get bits, or natural logs for nats; be consistent across a bound.

Mutual Information

I (X; Y) = H (Y) - H (Y ∣ X) = x, y \sum p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}

Explanation: Mutual information is the expected information that X reveals about Y. It’s central in Fano’s inequality and data processing.

Fano's Inequality (Classification Error)

P_{e} \geq 1 - \frac{I ( X ; Y ) + lo g 2}{lo g ∣ Y ∣}

Explanation: Any classifier on an M-class problem must have error at least this large. If the data carries little information about labels, error is necessarily high.

Fano (Conditional Entropy Form)

H (Y ∣ X) \leq h (P_{e}) + P_{e} lo g (∣ Y ∣ - 1)

Explanation: An equivalent form bounding conditional entropy via error probability. Rearranging gives implicit lower bounds on $P_{e}$ .

Data Processing Inequality

I (g (X); Y) \leq I (X; Y)

Explanation: Post-processing X with any function g cannot increase information about Y. Feature maps cannot create predictive information.

Le Cam Two-Point Lower Bound

R_{n} \geq \frac{Δ}{2} (1 - TV (P_{θ_{0}}^{n}, P_{θ_{1}}^{n}))

Explanation: If two parameters are $ $Δ$ $ apart but their n-sample distributions are hard to distinguish, any estimator must incur at least this much risk.

Pinsker's Inequality

TV (P, Q) \leq \frac{1}{2} D_{KL} (P ∥ Q)

Explanation: Relates total variation to KL divergence. Useful to turn KL computations into testing or risk lower bounds.

KL for Equal-Variance Gaussians

D_{KL} (N (μ_{0}, σ^{2}) ∥ N (μ_{1}, σ^{2})) = \frac{( μ _{0} - μ _{1} ) ^{2}}{2 σ ^{2}}

Explanation: A closed-form KL divergence that scales quadratically with mean separation. For n samples, KL scales linearly with n.

Assouad's Lemma (Schematic)

R_{n} \geq \frac{d}{8} j \in [d] min ρ_{j} (1 - TV (P_{+ j}, P_{- j}))

Explanation: Builds a d-dimensional hypercube of hard instances. If each coordinate is hard to distinguish, the total risk scales with d.

Comparison Sorting Lower Bound

C (n) \geq ⌈ lo g_{2} (n!)⌉

Explanation: Any decision tree distinguishing among n! permutations must have this depth. Hence $ $Ω$ (n $lo g$ n)$ comparisons are necessary.

Stirling-Type Lower Bound

lo g n! \geq n lo g n - n

Explanation: A convenient bound for log-factorials used to simplify $ $lo g$ (n!)$ into $ $Θ$ (n $lo g$ n)$.

Fooling-Set Communication Lower Bound

D (f) \geq lo g_{2} ∣ S ∣

Explanation: If a function f has a large fooling set S, any deterministic protocol needs at least this many bits of communication.

Statistical Query Lower Bound (Schematic)

q \geq SQDIM (H) \cdot τ^{- 2}

Explanation: The number of SQ queries q with tolerance $ $τ$ $ must scale with the SQ dimension. Noisy oracles limit learnability.

DP Mean Estimation Lower Bound

E [∣ \overset{μ}{^} - μ ∣] \geq c \cdot \frac{1}{n ε}

Explanation: Under $ $ε$ $-differential privacy, any estimator of a bounded-mean must incur at least this much error for a universal constant c.

Mutual Information Chain Rule

I (X^{n}; Y) = i = 1 \sum n I (X_{i}; Y ∣ X^{i - 1})

Explanation: Decomposes total information across n observations. Useful when observations contribute limited incremental information.

Complexity Analysis

The provided C++ examples are simulations or demonstrations rather than optimal algorithms, so their complexities reflect the underlying tasks. The comparison-sorting demo wraps std::sort with a counting comparator. std::sort (introsort) runs in O(n log n) time on average and worst case for random data, with O(log n) auxiliary stack space; the counting overhead is O(1) per comparison and does not change asymptotics. We also compute a lower bound via log2(n!) using lgamma in O(1) time per test size. The Fano’s inequality simulation generates N i.i.d. samples from a K-ary symmetric channel and accumulates counts in O(N) time and O(

K^{2}

) space for the joint table. Computing H(Y), H(Y|X), and I(X;Y) from counts is O(

K^{2}

). The actual classification error (predicting X as Y) is computed in a single pass, also O(N). Overall, time is linear in the number of samples and memory is quadratic in K (small in typical uses). The differential privacy mean-estimation demo runs T trials, each generating n data points and computing a mean, for O(T n) total time and O(1) extra space (besides the data vector reused per trial). Sampling Laplace noise is O(1) per trial. The experiment empirically confirms the trade-off between sample size n, privacy parameter

ε,

and accuracy; the private estimator’s expected absolute error scales roughly like 1/(n

ε)

plus the non-private statistical error of order 1/√n. None of the demos attempt to optimize constants; they are intended to be clear, faithful illustrations of the theoretical lower bounds’ shape rather than high-performance implementations.

Code Examples

Counting comparisons to illustrate the Ω(n log n) lower bound for comparison sorting

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Comparator that counts how many times it is invoked
5 struct CountingLess {
6     size_t* counter; // pointer to shared counter
7     bool operator()(const int& a, const int& b) const {
8         (*counter)++; // count this comparison
9         return a < b; // standard ascending order
10     }
11 };
12 
13 // Compute ceil(log2(n!)) using lgamma for stability: log(n!) = lgamma(n+1)
14 long long lower_bound_log2_factorial(int n) {
15     long double ln_fact = lgammal(n + 1.0L); // natural log of n!
16     long double lb = ln_fact / log(2.0L);    // convert to log2
17     return (long long)ceill(lb);
18 }
19 
20 int main() {
21     ios::sync_with_stdio(false);
22     cin.tie(nullptr);
23 
24     vector<int> ns = {1000, 2000, 5000, 10000};
25     std::mt19937 rng(12345);
26 
27     cout << fixed << setprecision(2);
28     for (int n : ns) {
29         vector<int> a(n);
30         iota(a.begin(), a.end(), 0);
31         shuffle(a.begin(), a.end(), rng);
32 
33         size_t comparisons = 0;
34         CountingLess cmp{&comparisons};
35 
36         // Sort with counting comparator
37         sort(a.begin(), a.end(), cmp);
38 
39         long long lb = lower_bound_log2_factorial(n);
40         cout << "n=" << n
41              << ", comparisons=" << comparisons
42              << ", ceil(log2(n!))=" << lb
43              << ", ratio(comparisons / (n log2 n))="
44              << (comparisons / (n * log2((long double)n)))
45              << "\n";
46 
47         // Sanity check: sorted?
48         if (!is_sorted(a.begin(), a.end())) {
49             cerr << "Error: array not sorted!\n";
50             return 1;
51         }
52     }
53     return 0;
54 }
55

This program instruments std::sort with a counting comparator to empirically show that the number of comparisons scales like n log2 n, matching the decision-tree lower bound ceil(log2(n!)). We compute log2(n!) via lgamma to avoid overflow. The ratio of comparisons to n log2 n stabilizes near a constant, illustrating the Ω(n log n) necessity for any comparison-based sort.

Time: O(n log n) per sortSpace: O(log n) auxiliary stack for std::sort plus O(1) counters

Fano's inequality in practice: mutual information vs. classification error in a K-ary symmetric channel

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Counts {
5     vector<long long> cX, cY;              // marginals
6     vector<vector<long long>> cXY;         // joint counts [x][y]
7     long long N = 0;
8 };
9 
10 // Generate N samples from a K-ary symmetric channel with error prob p.
11 // Y ~ Uniform{0..K-1}; with prob (1-p) set X=Y; else X is a random other label.
12 Counts simulate_channel(int K, double p, long long N, std::mt19937& rng) {
13     Counts cnt{vector<long long>(K,0), vector<long long>(K,0), vector<vector<long long>>(K, vector<long long>(K,0)), 0};
14     uniform_int_distribution<int> uniY(0, K-1);
15     uniform_real_distribution<double> uni01(0.0, 1.0);
16     for (long long i = 0; i < N; ++i) {
17         int y = uniY(rng);
18         int x = y;
19         if (uni01(rng) < p) {
20             // pick uniformly from the other K-1 labels
21             int r = uniY(rng);
22             if (r == y) r = (r + 1) % K; // simple resample without loop
23             x = r;
24         }
25         cnt.cX[x]++;
26         cnt.cY[y]++;
27         cnt.cXY[x][y]++;
28         cnt.N++;
29     }
30     return cnt;
31 }
32 
33 // Entropy in bits from counts
34 double entropy_bits(const vector<long long>& c) {
35     long long N = accumulate(c.begin(), c.end(), 0LL);
36     double H = 0.0;
37     for (auto v : c) if (v > 0) {
38         double p = (double)v / (double)N;
39         H -= p * log2(p);
40     }
41     return H;
42 }
43 
44 // Conditional entropy H(Y|X) in bits using joint counts cXY[x][y]
45 double conditional_entropy_bits(const vector<vector<long long>>& cXY) {
46     int Kx = (int)cXY.size();
47     int Ky = (int)cXY[0].size();
48     vector<long long> cX(Kx, 0);
49     long long N = 0;
50     for (int x = 0; x < Kx; ++x) {
51         for (int y = 0; y < Ky; ++y) {
52             cX[x] += cXY[x][y];
53             N += cXY[x][y];
54         }
55     }
56     double H = 0.0;
57     for (int x = 0; x < Kx; ++x) if (cX[x] > 0) {
58         double px = (double)cX[x] / (double)N;
59         double H_y_given_x = 0.0;
60         for (int y = 0; y < Ky; ++y) if (cXY[x][y] > 0) {
61             double p_y_given_x = (double)cXY[x][y] / (double)cX[x];
62             H_y_given_x -= p_y_given_x * log2(p_y_given_x);
63         }
64         H += px * H_y_given_x;
65     }
66     return H;
67 }
68 
69 int main() {
70     ios::sync_with_stdio(false);
71     cin.tie(nullptr);
72 
73     int K = 10;           // number of classes
74     double p = 0.3;       // channel error probability
75     long long N = 2'000'000; // number of samples
76     std::mt19937 rng(7);
77 
78     auto cnt = simulate_channel(K, p, N, rng);
79 
80     double H_Y = entropy_bits(cnt.cY);                 // should be ~log2 K
81     double H_Y_given_X = conditional_entropy_bits(cnt.cXY);
82     double I_XY = H_Y - H_Y_given_X;                   // mutual information in bits
83 
84     // Fano's inequality in base-2 logs: Pe >= 1 - (I + 1) / log2(K)
85     double log2K = log2((double)K);
86     double fano_lb = max(0.0, 1.0 - (I_XY + 1.0) / log2K);
87 
88     // Best classifier for symmetric channel is predict X, error ~= p
89     long long mismatches = 0;
90     for (int x = 0; x < K; ++x)
91         for (int y = 0; y < K; ++y)
92             if (x != y) mismatches += cnt.cXY[x][y];
93     double empirical_error = (double)mismatches / (double)cnt.N;
94 
95     cout << fixed << setprecision(4);
96     cout << "K=" << K << ", p=" << p << ", N=" << N << "\n";
97     cout << "H(Y) bits = " << H_Y << ", H(Y|X) bits = " << H_Y_given_X << "\n";
98     cout << "I(X;Y) bits = " << I_XY << "\n";
99     cout << "Fano lower bound on Pe = " << fano_lb << "\n";
100     cout << "Empirical error of predictor g(X)=X = " << empirical_error << "\n";
101     return 0;
102 }
103

We simulate a K-ary symmetric channel and estimate H(Y), H(Y|X), and I(X;Y) in bits from counts. Fano’s inequality gives a lower bound on any classifier’s error. For this channel, predicting X minimizes error and achieves error ≈ p. The bound may be loose but never violated, illustrating how information content caps achievable accuracy.

Time: O(N + K^2) for counting and entropy computationSpace: O(K^2) for the joint count table

Differential privacy trade-off: private mean estimation error vs. n and ε

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Sample Laplace(0, b) via inverse-CDF
5 double laplace(double b, std::mt19937& rng) {
6     uniform_real_distribution<double> U(0.0, 1.0);
7     double u = U(rng);
8     if (u < 0.5) return b * log(2.0 * u);
9     else return -b * log(2.0 * (1.0 - u));
10 }
11 
12 int main() {
13     ios::sync_with_stdio(false);
14     cin.tie(nullptr);
15 
16     // Underlying distribution: Bernoulli(p0), true mean = p0
17     double p0 = 0.5;
18     vector<int> ns = {200, 500, 1000, 2000};
19     vector<double> eps = {0.2, 0.5, 1.0};
20     int trials = 300; // increase for smoother curves
21     std::mt19937 rng(2024);
22     bernoulli_distribution Ber(p0);
23 
24     cout << fixed << setprecision(6);
25     for (int n : ns) {
26         // Non-private baseline
27         double mae_np = 0.0;
28         for (int t = 0; t < trials; ++t) {
29             int s = 0;
30             for (int i = 0; i < n; ++i) s += (int)Ber(rng);
31             double mean = (double)s / n;
32             mae_np += fabs(mean - p0);
33         }
34         mae_np /= trials;
35         cout << "n=" << n << ", non-private MAE≈" << mae_np << "\n";
36 
37         for (double e : eps) {
38             double b = 1.0 / (n * e); // Laplace scale for ε-DP mean over [0,1]
39             double mae_dp = 0.0;
40             for (int t = 0; t < trials; ++t) {
41                 int s = 0;
42                 for (int i = 0; i < n; ++i) s += (int)Ber(rng);
43                 double mean = (double)s / n;
44                 double priv_mean = mean + laplace(b, rng);
45                 // Clip to [0,1] to respect range; clipping does not break DP and helps utility metrics
46                 priv_mean = min(1.0, max(0.0, priv_mean));
47                 mae_dp += fabs(priv_mean - p0);
48             }
49             mae_dp /= trials;
50             cout << "  ε=" << e << ": private MAE≈" << mae_dp
51                  << ", theory ~ c/(nε)=" << (1.0 / (n * e)) << " (up to constants)\n";
52         }
53         cout << "\n";
54     }
55     return 0;
56 }
57

We estimate the mean of a Bernoulli(0.5) distribution privately using the Laplace mechanism with scale 1/(n ε), which guarantees ε-DP for means of [0,1]-bounded data. The measured mean absolute error (MAE) decays like 1/√n for non-private estimation and has an added ≈ Ω(1/(n ε)) term for the private estimator, matching differential privacy lower bounds up to constants.

Time: O(trials × n)Space: O(1) beyond storing running sums

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Comparator that counts how many times it is invoked
5	struct CountingLess {
6	size_t* counter; // pointer to shared counter
7	bool operator()(const int& a, const int& b) const {
8	(*counter)++; // count this comparison
9	return a < b; // standard ascending order
10	}
11	};
12
13	// Compute ceil(log2(n!)) using lgamma for stability: log(n!) = lgamma(n+1)
14	long long lower_bound_log2_factorial(int n) {
15	long double ln_fact = lgammal(n + 1.0L); // natural log of n!
16	long double lb = ln_fact / log(2.0L); // convert to log2
17	return (long long)ceill(lb);
18	}
19
20	int main() {
21	ios::sync_with_stdio(false);
22	cin.tie(nullptr);
23
24	vector<int> ns = {1000, 2000, 5000, 10000};
25	std::mt19937 rng(12345);
26
27	cout << fixed << setprecision(2);
28	for (int n : ns) {
29	vector<int> a(n);
30	iota(a.begin(), a.end(), 0);
31	shuffle(a.begin(), a.end(), rng);
32
33	size_t comparisons = 0;
34	CountingLess cmp{&comparisons};
35
36	// Sort with counting comparator
37	sort(a.begin(), a.end(), cmp);
38
39	long long lb = lower_bound_log2_factorial(n);
40	cout << "n=" << n
41	<< ", comparisons=" << comparisons
42	<< ", ceil(log2(n!))=" << lb
43	<< ", ratio(comparisons / (n log2 n))="
44	<< (comparisons / (n * log2((long double)n)))
45	<< "\n";
46
47	// Sanity check: sorted?
48	if (!is_sorted(a.begin(), a.end())) {
49	cerr << "Error: array not sorted!\n";
50	return 1;
51	}
52	}
53	return 0;
54	}
55

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Counts {
5	vector<long long> cX, cY; // marginals
6	vector<vector<long long>> cXY; // joint counts [x][y]
7	long long N = 0;
8	};
9
10	// Generate N samples from a K-ary symmetric channel with error prob p.
11	// Y ~ Uniform{0..K-1}; with prob (1-p) set X=Y; else X is a random other label.
12	Counts simulate_channel(int K, double p, long long N, std::mt19937& rng) {
13	Counts cnt{vector<long long>(K,0), vector<long long>(K,0), vector<vector<long long>>(K, vector<long long>(K,0)), 0};
14	uniform_int_distribution<int> uniY(0, K-1);
15	uniform_real_distribution<double> uni01(0.0, 1.0);
16	for (long long i = 0; i < N; ++i) {
17	int y = uniY(rng);
18	int x = y;
19	if (uni01(rng) < p) {
20	// pick uniformly from the other K-1 labels
21	int r = uniY(rng);
22	if (r == y) r = (r + 1) % K; // simple resample without loop
23	x = r;
24	}
25	cnt.cX[x]++;
26	cnt.cY[y]++;
27	cnt.cXY[x][y]++;
28	cnt.N++;
29	}
30	return cnt;
31	}
32
33	// Entropy in bits from counts
34	double entropy_bits(const vector<long long>& c) {
35	long long N = accumulate(c.begin(), c.end(), 0LL);
36	double H = 0.0;
37	for (auto v : c) if (v > 0) {
38	double p = (double)v / (double)N;
39	H -= p * log2(p);
40	}
41	return H;
42	}
43
44	// Conditional entropy H(Y\|X) in bits using joint counts cXY[x][y]
45	double conditional_entropy_bits(const vector<vector<long long>>& cXY) {
46	int Kx = (int)cXY.size();
47	int Ky = (int)cXY[0].size();
48	vector<long long> cX(Kx, 0);
49	long long N = 0;
50	for (int x = 0; x < Kx; ++x) {
51	for (int y = 0; y < Ky; ++y) {
52	cX[x] += cXY[x][y];
53	N += cXY[x][y];
54	}
55	}
56	double H = 0.0;
57	for (int x = 0; x < Kx; ++x) if (cX[x] > 0) {
58	double px = (double)cX[x] / (double)N;
59	double H_y_given_x = 0.0;
60	for (int y = 0; y < Ky; ++y) if (cXY[x][y] > 0) {
61	double p_y_given_x = (double)cXY[x][y] / (double)cX[x];
62	H_y_given_x -= p_y_given_x * log2(p_y_given_x);
63	}
64	H += px * H_y_given_x;
65	}
66	return H;
67	}
68
69	int main() {
70	ios::sync_with_stdio(false);
71	cin.tie(nullptr);
72
73	int K = 10; // number of classes
74	double p = 0.3; // channel error probability
75	long long N = 2'000'000; // number of samples
76	std::mt19937 rng(7);
77
78	auto cnt = simulate_channel(K, p, N, rng);
79
80	double H_Y = entropy_bits(cnt.cY); // should be ~log2 K
81	double H_Y_given_X = conditional_entropy_bits(cnt.cXY);
82	double I_XY = H_Y - H_Y_given_X; // mutual information in bits
83
84	// Fano's inequality in base-2 logs: Pe >= 1 - (I + 1) / log2(K)
85	double log2K = log2((double)K);
86	double fano_lb = max(0.0, 1.0 - (I_XY + 1.0) / log2K);
87
88	// Best classifier for symmetric channel is predict X, error ~= p
89	long long mismatches = 0;
90	for (int x = 0; x < K; ++x)
91	for (int y = 0; y < K; ++y)
92	if (x != y) mismatches += cnt.cXY[x][y];
93	double empirical_error = (double)mismatches / (double)cnt.N;
94
95	cout << fixed << setprecision(4);
96	cout << "K=" << K << ", p=" << p << ", N=" << N << "\n";
97	cout << "H(Y) bits = " << H_Y << ", H(Y\|X) bits = " << H_Y_given_X << "\n";
98	cout << "I(X;Y) bits = " << I_XY << "\n";
99	cout << "Fano lower bound on Pe = " << fano_lb << "\n";
100	cout << "Empirical error of predictor g(X)=X = " << empirical_error << "\n";
101	return 0;
102	}
103

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Sample Laplace(0, b) via inverse-CDF
5	double laplace(double b, std::mt19937& rng) {
6	uniform_real_distribution<double> U(0.0, 1.0);
7	double u = U(rng);
8	if (u < 0.5) return b * log(2.0 * u);
9	else return -b * log(2.0 * (1.0 - u));
10	}
11
12	int main() {
13	ios::sync_with_stdio(false);
14	cin.tie(nullptr);
15
16	// Underlying distribution: Bernoulli(p0), true mean = p0
17	double p0 = 0.5;
18	vector<int> ns = {200, 500, 1000, 2000};
19	vector<double> eps = {0.2, 0.5, 1.0};
20	int trials = 300; // increase for smoother curves
21	std::mt19937 rng(2024);
22	bernoulli_distribution Ber(p0);
23
24	cout << fixed << setprecision(6);
25	for (int n : ns) {
26	// Non-private baseline
27	double mae_np = 0.0;
28	for (int t = 0; t < trials; ++t) {
29	int s = 0;
30	for (int i = 0; i < n; ++i) s += (int)Ber(rng);
31	double mean = (double)s / n;
32	mae_np += fabs(mean - p0);
33	}
34	mae_np /= trials;
35	cout << "n=" << n << ", non-private MAE≈" << mae_np << "\n";
36
37	for (double e : eps) {
38	double b = 1.0 / (n * e); // Laplace scale for ε-DP mean over [0,1]
39	double mae_dp = 0.0;
40	for (int t = 0; t < trials; ++t) {
41	int s = 0;
42	for (int i = 0; i < n; ++i) s += (int)Ber(rng);
43	double mean = (double)s / n;
44	double priv_mean = mean + laplace(b, rng);
45	// Clip to [0,1] to respect range; clipping does not break DP and helps utility metrics
46	priv_mean = min(1.0, max(0.0, priv_mean));
47	mae_dp += fabs(priv_mean - p0);
48	}
49	mae_dp /= trials;
50	cout << " ε=" << e << ": private MAE≈" << mae_dp
51	<< ", theory ~ c/(nε)=" << (1.0 / (n * e)) << " (up to constants)\n";
52	}
53	cout << "\n";
54	}
55	return 0;
56	}
57