Bayesian Inference
Key Points
- •Bayesian inference updates prior beliefs with observed data to produce a posterior distribution P( D).
- •The key rule is Bayes’ theorem: posterior is proportional to likelihood times prior and normalized by evidence.
- •Conjugate priors (e.g., Beta-Binomial, Normal-Normal) yield closed-form posteriors that are fast to compute.
- •Posterior predictive distributions integrate over parameter uncertainty to make future predictions with calibrated uncertainty.
- •Model comparison in Bayesian inference uses the evidence (marginal likelihood), which penalizes overly complex models.
- •When no closed form exists, Markov Chain Monte Carlo (MCMC) and other sampling methods approximate the posterior.
- •Good priors and numerical stability (e.g., using log-likelihoods) are crucial for reliable results.
- •C++ implementations typically maintain sufficient statistics for O(n) updates or run MCMC with O(T) per-iteration cost.
Prerequisites
- →Basic Probability and Random Variables — Understanding distributions, expectations, and conditional probability is essential for Bayes’ theorem and likelihoods.
- →Bayes’ Rule — Bayesian inference is a systematic application of Bayes’ rule to parameter estimation and prediction.
- →Calculus and Integration — Posteriors and predictive distributions involve integrals over parameter spaces.
- →Likelihood and Estimation — Separating likelihood from prior is crucial to formulating the posterior correctly.
- →C++ Programming (I/O, math, STL) — Implementing Bayesian updates and samplers requires numeric functions, containers, and random number generation.
- →Numerical Stability (log-space arithmetic) — Probabilities can underflow; using logs and lgamma avoids numerical errors.
- →Normal Distribution Properties — Closed-form conjugate updates and predictive distributions rely on Normal identities.
- →Monte Carlo Methods — When closed forms are unavailable, sampling approximations are used to compute expectations.
Detailed Explanation
Tap terms for definitions01Overview
Bayesian inference is a framework for learning from data that treats unknown quantities (parameters, hypotheses) as random variables. Instead of producing a single best estimate, it computes a full posterior distribution P(\theta\mid D) that quantifies all plausible values of \theta given observed data D. The update is governed by Bayes’ theorem: new evidence reshapes prior beliefs proportionally to how likely the evidence would be if a parameter value were true. This yields not only point summaries (mean, median, MAP) but also credible intervals and predictive distributions that naturally capture uncertainty. In practice, Bayesian inference often uses conjugate priors to obtain closed-form updates, enabling efficient online or streaming updates. For complex models where integrals are intractable, numerical approximations such as MCMC or variational inference are employed. The approach is particularly powerful in settings with limited data, hierarchical structures, and when principled uncertainty quantification is required. By integrating prior knowledge with observed data and propagating uncertainty to predictions, Bayesian inference provides transparent and robust conclusions that go beyond single-number estimates.
02Intuition & Analogies
Imagine you’re guessing the bias of a coin. Before tossing it, you might believe it’s roughly fair but not exactly 50-50. That belief is your prior: a distribution over possible biases. Now you flip the coin 10 times and see 7 heads. If a coin were truly fair, 7 heads is possible but not very typical; however, a coin with a stronger bias toward heads would make the data more likely. Bayesian inference formalizes this intuition: it increases the plausibility of coin biases that could have more easily produced the observed data, and decreases the plausibility of those that could not. The result is a posterior distribution—a full picture of your updated belief over all possible biases. Think of it like adjusting a spotlight: the light (belief) starts broadly on 0 to 1, and after seeing data, it brightens around the parameter values that better explain the evidence, while still illuminating nearby regions to reflect uncertainty. When you want to predict the next few coin flips, you don’t pick a single bias; instead, you average predictions over all plausible biases, weighted by how credible each is given the data. This yields predictive probabilities that are more cautious when data are scarce and more decisive as evidence accumulates, much like how a detective forms a more precise theory as clues pile up, yet always retains some doubt until overwhelming evidence arrives.
03Formal Definition
04When to Use
Use Bayesian inference when you need principled uncertainty quantification, especially with small or noisy datasets. If you possess prior domain knowledge (e.g., physical limits, historical rates), encode it as a prior to stabilize learning and reduce overfitting. In online and streaming scenarios, conjugate Bayesian models allow O(1) or O(n) updates via sufficient statistics, making them ideal for A/B testing, click-through modeling, reliability estimation, and sensor fusion. Hierarchical (multilevel) Bayesian models are powerful when data are grouped (e.g., students within schools, ads within campaigns), enabling partial pooling to share strength across related groups. For predictive tasks where decision costs matter (e.g., medical diagnostics), posterior predictive distributions provide calibrated probabilities that can be plugged into decision-theoretic rules. When models are complex or likelihoods are not analytically tractable, sampling methods like MCMC or approximate methods like variational inference allow you to still leverage the Bayesian machinery. Bayesian model comparison via the evidence (or Bayes factors) is useful when selecting among competing hypotheses while naturally penalizing complexity.
⚠️Common Mistakes
Common pitfalls include confusing the likelihood P(D\mid \theta) with the posterior P(\theta\mid D); the former scores parameters given data, while the latter is the updated belief distribution. Choosing an overconfident prior can drown out the data, whereas using an improper or too-diffuse prior without care can lead to non-identifiable or improper posteriors. Ignoring the evidence term when comparing models leads to overfitting; always use marginal likelihoods or approximations when doing model selection. Numerically, computing products of probabilities can underflow; work in log-space and use \log \Gamma (lgamma) for factorials and Beta functions. In MCMC, poor proposal scales yield low acceptance or slow mixing; always diagnose convergence with multiple chains, R-hat, effective sample size, and trace plots. Double-counting data by using it both to set the prior and as the likelihood invalidates uncertainty estimates; separate prior elicitation from data fitting. Finally, interpreting credible intervals as frequentist confidence intervals is incorrect; a 95% credible interval means that, under the posterior, the parameter lies in the interval with 95% probability, which is a different statement from frequentist coverage.
Key Formulas
Bayes' Theorem
Explanation: The posterior equals likelihood times prior divided by the evidence. The evidence ensures the posterior integrates to 1.
Evidence (Marginal Likelihood)
Explanation: This integral normalizes the posterior and is used for model comparison. It averages the likelihood under the prior.
Posterior Predictive
Explanation: Predictions weight each parameter value by its posterior plausibility. This integrates parameter uncertainty into forecasts.
Beta-Binomial Posterior Hyperparameters
Explanation: After observing k heads in n trials with Beta(,) prior, the posterior is Beta(','). This update adds data counts to prior pseudo-counts.
Beta-Binomial Predictive
Explanation: The number of future successes in m trials has a Beta-Binomial distribution. The Beta function B(,) arises from integrating out .
Normal-Normal Posterior (Known Variance)
Explanation: With a Normal prior and known observation variance, the posterior is Normal. The posterior mean is a precision-weighted average of prior mean and sample mean.
Normal-Normal Predictive
Explanation: The predictive variance is the sum of noise variance and parameter uncertainty. As data grow, shrinks and predictions become sharper.
Monte Carlo Estimator
Explanation: Expectations under the posterior can be approximated by sample averages from MCMC draws. The approximation improves as M increases.
Law of Total Variance
Explanation: Predictive variance decomposes into expected noise plus variance due to parameter uncertainty. This explains why Bayesian predictions are often wider than plug-in estimates.
MAP vs. MLE
Explanation: MAP maximizes the posterior, which is the sum of log-likelihood and log-prior. If the prior is flat, MAP reduces to MLE.
Log-Beta via Gamma
Explanation: Computing Beta functions in log-space improves numerical stability. The Gamma function generalizes factorials.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute log Beta function using lgamma for numerical stability 5 double log_beta(double a, double b) { 6 return lgamma(a) + lgamma(b) - lgamma(a + b); 7 } 8 9 // Compute log binomial coefficient C(n,k) using lgamma 10 double log_choose(int n, int k) { 11 if (k < 0 || k > n) return -INFINITY; 12 return lgamma(n + 1.0) - lgamma(k + 1.0) - lgamma(n - k + 1.0); 13 } 14 15 int main(){ 16 ios::sync_with_stdio(false); 17 cin.tie(nullptr); 18 19 // Prior: Beta(alpha, beta) 20 double alpha = 2.0; // prior pseudo-heads 21 double beta = 2.0; // prior pseudo-tails 22 23 // Observed data: n trials, k successes (heads) 24 int n = 20; 25 int k = 14; 26 27 // Posterior hyperparameters 28 double alpha_post = alpha + k; 29 double beta_post = beta + (n - k); 30 31 // Posterior summaries 32 double mean_post = alpha_post / (alpha_post + beta_post); 33 double var_post = (alpha_post * beta_post) / 34 ((alpha_post + beta_post) * (alpha_post + beta_post) * (alpha_post + beta_post + 1.0)); 35 // MAP exists only if alpha_post > 1 and beta_post > 1 36 double map_post; 37 if (alpha_post > 1.0 && beta_post > 1.0) { 38 map_post = (alpha_post - 1.0) / (alpha_post + beta_post - 2.0); 39 } else { 40 map_post = mean_post; // fallback 41 } 42 43 // 95% normal-approx credible interval for theta (approximation) 44 double sd_post = sqrt(var_post); 45 double ci_low = max(0.0, mean_post - 1.96 * sd_post); 46 double ci_high = min(1.0, mean_post + 1.96 * sd_post); 47 48 cout << fixed << setprecision(6); 49 cout << "Posterior Beta(" << alpha_post << ", " << beta_post << ")\n"; 50 cout << "Posterior mean: " << mean_post << "\n"; 51 cout << "Posterior MAP: " << map_post << "\n"; 52 cout << "Approx 95% CI: [" << ci_low << ", " << ci_high << "]\n\n"; 53 54 // Posterior predictive for m future trials has Beta-Binomial distribution 55 int m = 5; // number of future trials 56 cout << "Predictive P(K=k for m=" << m << ")" << "\n"; 57 for (int kfuture = 0; kfuture <= m; ++kfuture) { 58 // log p = log C(m,k) + log B(alpha'+k, beta'+m-k) - log B(alpha', beta') 59 double logp = log_choose(m, kfuture) 60 + log_beta(alpha_post + kfuture, beta_post + (m - kfuture)) 61 - log_beta(alpha_post, beta_post); 62 cout << "k=" << kfuture << ": " << exp(logp) << "\n"; 63 } 64 65 return 0; 66 } 67
This program performs Bayesian updating for a coin-bias problem with a Beta prior and Binomial likelihood. It computes the posterior Beta hyperparameters, the posterior mean and MAP, and an approximate 95% credible interval using the normal approximation of a Beta distribution. It also computes the Beta-Binomial predictive probabilities for k future successes in m trials using log-space computations with lgamma for numerical stability.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct PosteriorNormal { 5 double mu_n; // posterior mean 6 double tau2_n; // posterior variance 7 }; 8 9 PosteriorNormal update_normal_normal(double mu0, double tau2_0, double sigma2, const vector<double>& x){ 10 int n = (int)x.size(); 11 double sumx = accumulate(x.begin(), x.end(), 0.0); 12 double prec0 = 1.0 / tau2_0; // prior precision 13 double precx = n / sigma2; // data precision 14 double precn = prec0 + precx; // posterior precision 15 double mu_n = (prec0 * mu0 + (sumx / sigma2)) / precn; 16 double tau2_n = 1.0 / precn; 17 return {mu_n, tau2_n}; 18 } 19 20 int main(){ 21 ios::sync_with_stdio(false); 22 cin.tie(nullptr); 23 24 // Prior: theta ~ N(mu0, tau0^2) 25 double mu0 = 0.0; 26 double tau2_0 = 4.0; // prior variance (tau0^2) 27 28 // Known observation variance sigma^2 29 double sigma2 = 1.0; 30 31 // Observed data 32 vector<double> x = {0.2, -0.1, 0.4, 0.3, -0.2, 0.1}; 33 34 // Posterior update 35 PosteriorNormal post = update_normal_normal(mu0, tau2_0, sigma2, x); 36 37 // 95% credible interval (Normal posterior) 38 double sd_post = sqrt(post.tau2_n); 39 double ci_low = post.mu_n - 1.96 * sd_post; 40 double ci_high = post.mu_n + 1.96 * sd_post; 41 42 // Posterior predictive for a new x_new: Normal(mu_n, sigma2 + tau2_n) 43 double pred_mean = post.mu_n; 44 double pred_var = sigma2 + post.tau2_n; 45 double pred_sd = sqrt(pred_var); 46 47 cout << fixed << setprecision(6); 48 cout << "Posterior mean: " << post.mu_n << "\n"; 49 cout << "Posterior var: " << post.tau2_n << "\n"; 50 cout << "95% CI: [" << ci_low << ", " << ci_high << "]\n"; 51 cout << "Predictive mean:" << pred_mean << "; sd=" << pred_sd << "\n"; 52 53 return 0; 54 } 55
This code updates a Normal prior with Normal likelihood (known observation variance) to obtain a Normal posterior. It computes the posterior mean and variance as precision-weighted combinations of the prior and data, gives a 95% credible interval, and reports the predictive distribution parameters for a future observation.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Log posterior for Beta(alpha', beta') (up to constant) 5 double log_posterior_beta(double theta, double a, double b) { 6 if (theta <= 0.0 || theta >= 1.0) return -INFINITY; 7 return (a - 1.0) * log(theta) + (b - 1.0) * log1p(-theta); // log1p(-theta) = log(1 - theta) 8 } 9 10 // Reflect a point into (0,1) interval for random-walk proposals 11 double reflect01(double x){ 12 // Handle multiple crossings if step is large 13 while (x < 0.0 || x > 1.0) { 14 if (x < 0.0) x = -x; // reflect below 0 15 if (x > 1.0) x = 2.0 - x; // reflect above 1 16 } 17 // Avoid exactly 0 or 1 which are invalid for log 18 const double eps = 1e-12; 19 if (x <= 0.0) x = eps; 20 if (x >= 1.0) x = 1.0 - eps; 21 return x; 22 } 23 24 int main(){ 25 ios::sync_with_stdio(false); 26 cin.tie(nullptr); 27 28 // Data: k successes out of n trials with Beta(alpha, beta) prior 29 int n = 50, k = 32; 30 double alpha = 2.0, beta = 2.0; // prior 31 double a_post = alpha + k; // posterior alpha' 32 double b_post = beta + (n - k); // posterior beta' 33 34 // MCMC settings 35 int burn_in = 5000; 36 int iters = 50000; 37 int thin = 10; 38 double step_sd = 0.05; // proposal std dev for random-walk in theta-space 39 40 mt19937_64 rng(random_device{}()); 41 normal_distribution<double> norm(0.0, step_sd); 42 uniform_real_distribution<double> unif(0.0, 1.0); 43 44 // Initialize at posterior mean 45 double theta = a_post / (a_post + b_post); 46 vector<double> samples; 47 samples.reserve((iters - burn_in) / max(1, thin)); 48 49 int accepted = 0; 50 for (int t = 0; t < iters; ++t) { 51 double prop = reflect01(theta + norm(rng)); 52 double logp_cur = log_posterior_beta(theta, a_post, b_post); 53 double logp_prop = log_posterior_beta(prop, a_post, b_post); 54 double loga = logp_prop - logp_cur; // symmetric proposal -> Hastings ratio cancels 55 if (loga >= 0.0 || log(unif(rng)) < loga) { 56 theta = prop; 57 if (t >= burn_in) ++accepted; 58 } 59 if (t >= burn_in && ((t - burn_in) % thin == 0)) { 60 samples.push_back(theta); 61 } 62 } 63 64 // Compute posterior mean estimate from samples 65 double mean_est = accumulate(samples.begin(), samples.end(), 0.0) / samples.size(); 66 67 cout << fixed << setprecision(6); 68 cout << "Samples kept: " << samples.size() << "\n"; 69 cout << "Acceptance rate (post-burn-in): " 70 << (double)accepted / max(1, iters - burn_in) << "\n"; 71 cout << "Posterior mean (MC): " << mean_est << "\n"; 72 cout << "Posterior mean (analytic): " << a_post / (a_post + b_post) << "\n"; 73 74 return 0; 75 } 76
This example samples from a univariate Beta posterior using a Metropolis-Hastings random-walk proposal. It demonstrates how to construct a log-posterior, handle parameter constraints with reflection into (0,1), run burn-in and thinning, and estimate posterior means from samples. The Monte Carlo estimate is compared with the analytic posterior mean for validation.