Differential Privacy Theory
Key Points
- •Differential privacy (DP) guarantees that the output of a randomized algorithm does not change much when one person’s data is added or removed.
- •The ( definition bounds how much the probability of any output can change between adjacent datasets; smaller ε and δ mean stronger privacy.
- •Laplace and Gaussian mechanisms add carefully calibrated noise based on a query’s sensitivity to protect privacy.
- •The exponential mechanism privately selects a discrete output using a utility score, balancing usefulness and privacy.
- •Privacy costs add up under composition; advanced composition shows roughly O( growth for k queries.
- •Post-processing cannot hurt privacy: any function applied to a DP output remains DP with the same parameters.
- •DP-SGD trains models privately by clipping per-example gradients and adding Gaussian noise, with privacy amplified by subsampling minibatches.
- •DP is essential in AI/ML for safe data analysis, legal compliance, and building trust with users and stakeholders.
Prerequisites
- →Probability and random variables — DP is defined via probabilities over randomized mechanisms and requires understanding distributions and independence.
- →Calculus and linear algebra (vectors, norms) — Sensitivity and DP-SGD clipping use L1/L2 norms; gradient-based learning requires basic calculus and vector algebra.
- →Algorithm analysis (Big-O) — To understand performance trade-offs of DP mechanisms and their feasibility on large datasets.
- →Statistics (means, variance) — Noise calibration and interpretation of noisy statistics rely on variance and concentration ideas.
- →Optimization and SGD — DP-SGD modifies standard SGD with clipping and noise; knowing baseline SGD helps.
- →Measure/continuous probability (light) — Formal DP definitions reference measurable sets and densities; a light touch helps with rigor.
Detailed Explanation
Tap terms for definitions01Overview
Differential Privacy (DP) is a rigorous framework for designing algorithms that learn from data while protecting each individual’s information. Imagine answering questions about a dataset—counts, averages, or even training a model—without revealing whether any specific person participated. DP achieves this by injecting randomness (noise) into outputs so that the presence or absence of a single individual changes the output distribution only slightly. A mechanism is (ε, δ)-differentially private if, for any two adjacent datasets (differing by one person) and any set of possible outputs, the probabilities of producing outputs from these datasets are nearly the same—bounded by ε and δ. Smaller parameters mean stronger privacy.
The core design pattern is to measure how sensitive a statistic is to one person’s data (its sensitivity) and then add calibrated noise that obscures any individual’s effect. The Laplace mechanism is used for L1 sensitivity; the Gaussian mechanism is used for L2 sensitivity in the (ε, δ) setting. For selecting among discrete options (like picking a “best” item), the exponential mechanism uses a utility score to bias noisy selection toward high-utility outputs.
DP supports critical principles: composition (privacy degrades with multiple queries), post-processing invariance (any transformation of a DP output remains DP), and privacy amplification (subsampling can improve effective privacy). In machine learning, DP-SGD integrates these ideas by clipping gradients, adding noise, and tracking the cumulative privacy loss. The result is a mathematically provable guarantee that training doesn’t leak too much about any one person—vital for trustworthy analytics and compliant AI systems.
02Intuition & Analogies
Picture a stadium where each spectator can clap once (participate) or not at all (not participate). If a reporter wants to know how exciting the game is without learning about any single person, the stadium pipes in a small amount of artificial background noise. Now, whether one spectator claps or remains silent hardly shifts the overall volume—too much to tell for sure. That’s differential privacy: we add the right kind of noise so that any single person’s effect is washed out in the crowd.
Another analogy: think of a photo filter that lightly blurs a group picture. With the blur tuned just right, you can still see the group’s overall mood and composition (the signal), but it’s hard to recognize any single face (the individual). The blur strength corresponds to ε and δ: a smaller ε is like a stronger blur. Sensitivity measures how much the picture would change if you remove one face; a more sensitive statistic needs a stronger blur (more noise) to hide individuals.
Choosing among options privately is like running a lottery where better options get more tickets, but you still randomize the winner so that any one person’s data can’t deterministically force the outcome. That’s the exponential mechanism: it skews randomness toward good choices using a utility score, yet it preserves plausible deniability for each participant.
In training neural networks, imagine each example whispering a suggestion (its gradient). You cap each whisper’s volume (clip gradients) so no single voice can shout. Then you add a gentle hiss (Gaussian noise) to the combined whisper before updating the model. Even if someone eavesdrops on the updates, they can’t confidently tell if a particular example was in the training set. Repeating this process many times adds up privacy costs, but careful accounting and subsampling help keep the total risk low.
03Formal Definition
04When to Use
Use differential privacy whenever releasing statistics, training models, or sharing insights from sensitive datasets where individual-level confidentiality matters. Common scenarios include publishing counts or averages (e.g., number of users in a region), releasing histograms, answering interactive queries, or deploying ML models trained on user data. If stakeholders demand formal guarantees beyond “we anonymized the data,” DP provides a quantifiable, legally relevant standard.
Choose the Laplace mechanism for numeric queries with known L1 sensitivity (e.g., counts); prefer the Gaussian mechanism for vector-valued or high-dimensional outputs under (ε, δ)-DP using L2 sensitivity. When selecting a discrete item (like the best hyperparameter or the top category) from a set, the exponential mechanism balances utility and privacy using a score function with bounded sensitivity. For model training, adopt DP-SGD: clip per-example gradients, add Gaussian noise, and track privacy with an accountant.
DP is also useful in data exploration workflows: if analysts iteratively query data, composition theorems keep track of the total privacy budget. Subsampling (minibatching) can amplify privacy, making iterative training or analysis more efficient. Finally, DP is valuable even for internal analytics where insider threats or accidental disclosures are concerns—post-processing invariance lets you safely compute downstream metrics from DP outputs.
⚠️Common Mistakes
- Underestimating sensitivity: Failing to bound or clip inputs leads to unbounded sensitivity, requiring infinite noise for DP. Always enforce data bounds (e.g., truncate to [a, b]) or clip per-example gradients in DP-SGD.
- Misinterpreting ε and δ: Choosing ε too large (e.g., ε ≫ 10) or δ comparable to 1/n (or worse) weakens privacy greatly. A rule of thumb is δ ≪ 1/n and ε in a modest range (e.g., 0.1–5), depending on context and policy.
- Ignoring composition: Answering many queries with the same ε without accounting leads to much weaker overall privacy. Use basic or advanced composition (or Rènyi/moments accountants) to track the cumulative privacy loss.
- Data-dependent parameter tuning without privacy: Calibrating noise using the private data itself (e.g., using the empirical max to set bounds) leaks information. Either use public side information, privacy-preserving estimation, or reserve part of the budget for parameter selection via the exponential mechanism.
- Assuming post-processing can fix non-DP outputs: Only outputs computed by a DP mechanism are protected; transforming a non-DP result doesn’t retroactively make it DP.
- Using poor randomness: In practice, use a high-quality RNG; avoid patterns like seeding with timestamps that can make outputs more predictable. Consider secure RNGs in adversarial settings.
- Floating-point and implementation leaks: Branches or exceptions conditioned on data can leak information; stick to numerically stable, constant-time-ish implementations when feasible, and document assumptions.
Key Formulas
(\varepsilon, \delta)-DP Definition
Explanation: A mechanism M is differentially private if, for any adjacent datasets D and D' and any event S, the probability of S under D is at most e^ε times that under D' plus This bounds how much any one person can influence the output distribution.
L1 Sensitivity
Explanation: L1 sensitivity is the maximum possible change in the output’s L1 norm when one record changes. It determines the scale of Laplace noise needed for pure DP on scalar/vector queries.
L2 Sensitivity
Explanation: L2 sensitivity is the maximum L2 change in the output when a single record changes. It is used to calibrate Gaussian noise under (
Laplace Mechanism
Explanation: Adding Laplace noise with scale b equal to sensitivity divided by ε yields privacy for real-valued queries. Larger sensitivity or smaller ε requires more noise.
Gaussian Mechanism Calibration
Explanation: For 0 < δ < 0.5, adding Gaussian noise with standard deviation σ at least this large ensures ( This is a commonly used sufficient condition.
Exponential Mechanism
Explanation: The probability of selecting result r grows exponentially with its utility u, scaled by ε and the utility’s sensitivity Δu. This enables private selection among discrete outcomes.
Basic Composition
Explanation: Running k mechanisms each with ( results in overall (k k This linear bound is simple but not always tight.
Advanced Composition (One Form)
Explanation: A tighter composition bound uses an extra small parameter Privacy grows roughly like O( plus a smaller linear term in ε for small
Post-processing Invariance
Explanation: Applying any further processing to a DP output cannot worsen its privacy parameters. This allows safe downstream analytics on DP outputs.
Privacy Amplification by Poisson Subsampling
Explanation: When each record is included independently with probability q, the effective privacy parameters improve. This helps in minibatch training such as DP-SGD.
DP-SGD Noisy Gradient
Explanation: Per-example gradients are clipped to norm C, summed, and Gaussian noise with stddev is added before averaging. This is the core DP-SGD update used in private deep learning.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Sample from Laplace(0, b) using inverse CDF 5 // If U ~ Uniform(0,1), then X = b * sign(U-0.5) * ln(1 - 2|U-0.5|) 6 // We implement the equivalent branch form for numerical stability. 7 double sample_laplace(double b, mt19937_64 &rng) { 8 uniform_real_distribution<double> U(0.0, 1.0); 9 double u = U(rng); 10 if (u < 0.5) { 11 return b * log(2.0 * u); // negative value 12 } else { 13 return -b * log(2.0 * (1.0 - u)); // positive value 14 } 15 } 16 17 // Private count query with Laplace noise. Sensitivity Δ1 = 1 for counts. 18 int dp_count_with_laplace(const vector<int> &binary_data, double epsilon, mt19937_64 &rng) { 19 // Enforce bounds: treat any nonzero as 1. This keeps sensitivity at 1. 20 int true_count = 0; 21 for (int x : binary_data) true_count += (x != 0); 22 23 double b = 1.0 / max(epsilon, 1e-12); // scale = Δ1/ε, with guard 24 double noise = sample_laplace(b, rng); 25 // Post-process: round to nearest int and clamp to valid range if desired 26 int noisy = (int) llround((double)true_count + noise); 27 noisy = max(0, min((int)binary_data.size(), noisy)); 28 return noisy; 29 } 30 31 int main(){ 32 random_device rd; mt19937_64 rng(rd()); 33 34 // Example dataset of 0/1 participation flags 35 vector<int> data = {1,0,1,1,1,0,0,1,1,0,1,0,1}; 36 37 double epsilon = 0.5; // stronger privacy when smaller 38 int noisy_count = dp_count_with_laplace(data, epsilon, rng); 39 40 cout << "Noisy count (ε=" << epsilon << ") = " << noisy_count << "\n"; 41 return 0; 42 } 43
This program releases a differentially private count using the Laplace mechanism. Counts have L1 sensitivity 1, so adding Laplace noise with scale 1/ε gives ε-DP (δ=0). The code enforces bounds by mapping inputs to {0,1}, samples Laplace noise via inverse CDF, adds it to the true count, and clamps to a plausible range in post-processing (which does not affect privacy).
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Numerically stable sampling from probabilities proportional to exp(alpha * u_i) 5 int sample_exponential_mechanism(const vector<double> &utilities, double alpha, mt19937_64 &rng) { 6 // alpha = ε / (2Δu). Ensure Δu is known and bounded by design. 7 double m = *max_element(utilities.begin(), utilities.end()); 8 vector<double> weights; 9 weights.reserve(utilities.size()); 10 for (double u : utilities) weights.push_back(exp(alpha * (u - m))); // log-sum-exp trick 11 double Z = accumulate(weights.begin(), weights.end(), 0.0); 12 uniform_real_distribution<double> U(0.0, Z); 13 double r = U(rng); 14 double c = 0.0; 15 for (int i = 0; i < (int)weights.size(); ++i) { 16 c += weights[i]; 17 if (r <= c) return i; 18 } 19 return (int)weights.size() - 1; // fallback 20 } 21 22 int main(){ 23 random_device rd; mt19937_64 rng(rd()); 24 25 // Suppose we want to privately choose the most popular category among m categories. 26 // Utility u(D, r) is the count of category r in the dataset (Δu = 1). 27 vector<int> data = {0,1,2,2,2,1,0,2,0,1,1,2,0,1,2}; 28 int m = 3; // categories 0,1,2 29 30 vector<double> counts(m, 0.0); 31 for (int x : data) { 32 if (0 <= x && x < m) counts[x] += 1.0; // enforce bounds 33 } 34 35 double epsilon = 0.8; // privacy budget for this selection 36 double delta_u = 1.0; // sensitivity of count utility 37 double alpha = epsilon / (2.0 * delta_u); 38 39 int chosen = sample_exponential_mechanism(counts, alpha, rng); 40 41 cout << "Privately chosen category (ε=" << epsilon << ") = " << chosen << "\n"; 42 return 0; 43 } 44
We privately select a category with probability proportional to exp(ε u / (2Δu)), where u is the count of that category and Δu=1. The log-sum-exp trick ensures numerical stability. This is useful for choosing top items, hyperparameters, or actions while preserving privacy.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct Example { double x, y; }; 5 6 // Clip a 2D gradient [gw, gb] to L2 norm C 7 array<double,2> clip_grad(array<double,2> g, double C) { 8 double norm = sqrt(g[0]*g[0] + g[1]*g[1]); 9 if (norm <= C || norm == 0.0) return g; 10 double scale = C / norm; 11 return {g[0]*scale, g[1]*scale}; 12 } 13 14 int main(){ 15 random_device rd; mt19937_64 rng(rd()); 16 17 // Tiny synthetic dataset for y ≈ w*x + b 18 vector<Example> data = {{-2, -3.8}, {-1, -2.0}, {0, 0.2}, {1, 2.1}, {2, 3.9}}; 19 20 // Model parameters 21 double w = 0.0, b = 0.0; 22 double lr = 0.1; // learning rate 23 double C = 1.0; // clipping norm 24 double noise_multiplier = 1.0; // z in σ = z * C 25 26 // Build a minibatch by subsampling each example with prob q 27 double q = 0.6; // subsampling rate (amplifies privacy) 28 bernoulli_distribution take(q); 29 30 vector<array<double,2>> per_example_grads; 31 for (auto &ex : data) { 32 if (!take(rng)) continue; 33 // Per-example gradient of squared error L = (w*x + b - y)^2 34 double err = (w * ex.x + b) - ex.y; 35 array<double,2> g = {2.0 * err * ex.x, 2.0 * err}; // [dL/dw, dL/db] 36 per_example_grads.push_back(clip_grad(g, C)); 37 } 38 39 if (per_example_grads.empty()) { 40 cout << "Empty minibatch; no update this step.\n"; 41 return 0; 42 } 43 44 // Aggregate clipped gradients 45 array<double,2> sumg = {0.0, 0.0}; 46 for (auto &g : per_example_grads) { sumg[0] += g[0]; sumg[1] += g[1]; } 47 48 // Add Gaussian noise with stddev σ*C to each coordinate (σ = noise_multiplier) 49 normal_distribution<double> N(0.0, noise_multiplier * C); 50 sumg[0] += N(rng); 51 sumg[1] += N(rng); 52 53 // Average by batch size 54 double B = (double)per_example_grads.size(); 55 array<double,2> avg = {sumg[0] / B, sumg[1] / B}; 56 57 // Parameter update 58 w -= lr * avg[0]; 59 b -= lr * avg[1]; 60 61 cout << fixed << setprecision(4); 62 cout << "After one DP-SGD step: w=" << w << ", b=" << b << " (B=" << B << ")\n"; 63 cout << "Note: Computing final (ε, δ) requires a privacy accountant (e.g., RDP).\n"; 64 return 0; 65 } 66
This code performs one DP-SGD step for 1D linear regression. It builds a subsampled minibatch, computes per-example gradients, clips them to L2 norm C, adds Gaussian noise with stddev σ·C (σ=noise multiplier), averages, and updates parameters. In practice, an RDP/moments accountant is used over many steps to compute the final (ε, δ), leveraging subsampling amplification.