🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Bayesian Inference

Key Points

  • •
    Bayesian inference updates prior beliefs with observed data to produce a posterior distribution P(θ∣ D).
  • •
    The key rule is Bayes’ theorem: posterior is proportional to likelihood times prior and normalized by evidence.
  • •
    Conjugate priors (e.g., Beta-Binomial, Normal-Normal) yield closed-form posteriors that are fast to compute.
  • •
    Posterior predictive distributions integrate over parameter uncertainty to make future predictions with calibrated uncertainty.
  • •
    Model comparison in Bayesian inference uses the evidence (marginal likelihood), which penalizes overly complex models.
  • •
    When no closed form exists, Markov Chain Monte Carlo (MCMC) and other sampling methods approximate the posterior.
  • •
    Good priors and numerical stability (e.g., using log-likelihoods) are crucial for reliable results.
  • •
    C++ implementations typically maintain sufficient statistics for O(n) updates or run MCMC with O(T) per-iteration cost.

Prerequisites

  • →Basic Probability and Random Variables — Understanding distributions, expectations, and conditional probability is essential for Bayes’ theorem and likelihoods.
  • →Bayes’ Rule — Bayesian inference is a systematic application of Bayes’ rule to parameter estimation and prediction.
  • →Calculus and Integration — Posteriors and predictive distributions involve integrals over parameter spaces.
  • →Likelihood and Estimation — Separating likelihood from prior is crucial to formulating the posterior correctly.
  • →C++ Programming (I/O, math, STL) — Implementing Bayesian updates and samplers requires numeric functions, containers, and random number generation.
  • →Numerical Stability (log-space arithmetic) — Probabilities can underflow; using logs and lgamma avoids numerical errors.
  • →Normal Distribution Properties — Closed-form conjugate updates and predictive distributions rely on Normal identities.
  • →Monte Carlo Methods — When closed forms are unavailable, sampling approximations are used to compute expectations.

Detailed Explanation

Tap terms for definitions

01Overview

Bayesian inference is a framework for learning from data that treats unknown quantities (parameters, hypotheses) as random variables. Instead of producing a single best estimate, it computes a full posterior distribution P(\theta\mid D) that quantifies all plausible values of \theta given observed data D. The update is governed by Bayes’ theorem: new evidence reshapes prior beliefs proportionally to how likely the evidence would be if a parameter value were true. This yields not only point summaries (mean, median, MAP) but also credible intervals and predictive distributions that naturally capture uncertainty. In practice, Bayesian inference often uses conjugate priors to obtain closed-form updates, enabling efficient online or streaming updates. For complex models where integrals are intractable, numerical approximations such as MCMC or variational inference are employed. The approach is particularly powerful in settings with limited data, hierarchical structures, and when principled uncertainty quantification is required. By integrating prior knowledge with observed data and propagating uncertainty to predictions, Bayesian inference provides transparent and robust conclusions that go beyond single-number estimates.

02Intuition & Analogies

Imagine you’re guessing the bias of a coin. Before tossing it, you might believe it’s roughly fair but not exactly 50-50. That belief is your prior: a distribution over possible biases. Now you flip the coin 10 times and see 7 heads. If a coin were truly fair, 7 heads is possible but not very typical; however, a coin with a stronger bias toward heads would make the data more likely. Bayesian inference formalizes this intuition: it increases the plausibility of coin biases that could have more easily produced the observed data, and decreases the plausibility of those that could not. The result is a posterior distribution—a full picture of your updated belief over all possible biases. Think of it like adjusting a spotlight: the light (belief) starts broadly on 0 to 1, and after seeing data, it brightens around the parameter values that better explain the evidence, while still illuminating nearby regions to reflect uncertainty. When you want to predict the next few coin flips, you don’t pick a single bias; instead, you average predictions over all plausible biases, weighted by how credible each is given the data. This yields predictive probabilities that are more cautious when data are scarce and more decisive as evidence accumulates, much like how a detective forms a more precise theory as clues pile up, yet always retains some doubt until overwhelming evidence arrives.

03Formal Definition

Given data D and parameter θ, Bayesian inference computes the posterior distribution via Bayes’ theorem: P(θ∣ D) = P(D)P(D∣θ)P(θ)​ where P(D) = ∫ P(D∣ θ)P(θ)\,dθ is the evidence (marginal likelihood) ensuring normalization. The likelihood P(D∣ θ) specifies how probable the data are under different parameter values, and the prior P(θ) encodes beliefs before seeing data. The posterior predictive distribution for a new observation x_{new} integrates parameter uncertainty: P(x_{new}∣ D) = ∫ P(x_{new}∣ θ)P(θ∣ D)\,dθ. For conjugate prior-likelihood pairs (common in exponential families), the posterior belongs to the same family as the prior, characterized by updated hyperparameters that combine prior strength and sufficient statistics of the data. Point summaries include the posterior mean E[θ∣ D], median, and maximum a posteriori (MAP) estimate argmaxθ​ P(θ∣ D), with MAP equivalently maximizing log P(D∣ θ)+log P(θ). When integrals are intractable, we approximate expectations E[f(θ)∣ D] using sampling-based estimators (e.g., MCMC): M1​∑m=1M​ f(θ(m)), where \{θ(m)\} are draws from P(θ∣ D).

04When to Use

Use Bayesian inference when you need principled uncertainty quantification, especially with small or noisy datasets. If you possess prior domain knowledge (e.g., physical limits, historical rates), encode it as a prior to stabilize learning and reduce overfitting. In online and streaming scenarios, conjugate Bayesian models allow O(1) or O(n) updates via sufficient statistics, making them ideal for A/B testing, click-through modeling, reliability estimation, and sensor fusion. Hierarchical (multilevel) Bayesian models are powerful when data are grouped (e.g., students within schools, ads within campaigns), enabling partial pooling to share strength across related groups. For predictive tasks where decision costs matter (e.g., medical diagnostics), posterior predictive distributions provide calibrated probabilities that can be plugged into decision-theoretic rules. When models are complex or likelihoods are not analytically tractable, sampling methods like MCMC or approximate methods like variational inference allow you to still leverage the Bayesian machinery. Bayesian model comparison via the evidence (or Bayes factors) is useful when selecting among competing hypotheses while naturally penalizing complexity.

⚠️Common Mistakes

Common pitfalls include confusing the likelihood P(D\mid \theta) with the posterior P(\theta\mid D); the former scores parameters given data, while the latter is the updated belief distribution. Choosing an overconfident prior can drown out the data, whereas using an improper or too-diffuse prior without care can lead to non-identifiable or improper posteriors. Ignoring the evidence term when comparing models leads to overfitting; always use marginal likelihoods or approximations when doing model selection. Numerically, computing products of probabilities can underflow; work in log-space and use \log \Gamma (lgamma) for factorials and Beta functions. In MCMC, poor proposal scales yield low acceptance or slow mixing; always diagnose convergence with multiple chains, R-hat, effective sample size, and trace plots. Double-counting data by using it both to set the prior and as the likelihood invalidates uncertainty estimates; separate prior elicitation from data fitting. Finally, interpreting credible intervals as frequentist confidence intervals is incorrect; a 95% credible interval means that, under the posterior, the parameter lies in the interval with 95% probability, which is a different statement from frequentist coverage.

Key Formulas

Bayes' Theorem

P(θ∣D)=P(D)P(D∣θ)P(θ)​

Explanation: The posterior equals likelihood times prior divided by the evidence. The evidence ensures the posterior integrates to 1.

Evidence (Marginal Likelihood)

P(D)=∫P(D∣θ)P(θ)dθ

Explanation: This integral normalizes the posterior and is used for model comparison. It averages the likelihood under the prior.

Posterior Predictive

P(xnew​∣D)=∫P(xnew​∣θ)P(θ∣D)dθ

Explanation: Predictions weight each parameter value by its posterior plausibility. This integrates parameter uncertainty into forecasts.

Beta-Binomial Posterior Hyperparameters

α′=α+k,β′=β+(n−k)

Explanation: After observing k heads in n trials with Beta(α,β) prior, the posterior is Beta(α',β'). This update adds data counts to prior pseudo-counts.

Beta-Binomial Predictive

P(K=k∣m,D)=(km​)B(α′,β′)B(α′+k,β′+m−k)​

Explanation: The number of future successes in m trials has a Beta-Binomial distribution. The Beta function B(⋅,⋅) arises from integrating out θ.

Normal-Normal Posterior (Known Variance)

μn​=1/τ02​+n/σ2μ0​/τ02​+nxˉ/σ2​,τn2​=1/τ02​+n/σ21​

Explanation: With a Normal prior and known observation variance, the posterior is Normal. The posterior mean is a precision-weighted average of prior mean and sample mean.

Normal-Normal Predictive

xnew​∣D∼N(μn​,σ2+τn2​)

Explanation: The predictive variance is the sum of noise variance and parameter uncertainty. As data grow, τn2​ shrinks and predictions become sharper.

Monte Carlo Estimator

E[θ∣D]=∫θP(θ∣D)dθ≈M1​m=1∑M​θ(m)

Explanation: Expectations under the posterior can be approximated by sample averages from MCMC draws. The approximation improves as M increases.

Law of Total Variance

Var(xnew​∣D)=E[Var(xnew​∣θ)]+Var(E[xnew​∣θ])

Explanation: Predictive variance decomposes into expected noise plus variance due to parameter uncertainty. This explains why Bayesian predictions are often wider than plug-in estimates.

MAP vs. MLE

θMAP​=argθmax​(logP(D∣θ)+logP(θ))

Explanation: MAP maximizes the posterior, which is the sum of log-likelihood and log-prior. If the prior is flat, MAP reduces to MLE.

Log-Beta via Gamma

logB(a,b)=logΓ(a)+logΓ(b)−logΓ(a+b)

Explanation: Computing Beta functions in log-space improves numerical stability. The Gamma function generalizes factorials.

Complexity Analysis

Closed-form Bayesian updates with conjugate priors typically require O(n) time to process n data points and O(1) additional work to produce the final posterior once sufficient statistics are known. For example, the Beta-Binomial model only needs counts of successes and failures, and the Normal-Normal model with known variance only needs the sample mean (or sum) and count. Space complexity is O(1) because you store only hyperparameters and sufficient statistics rather than the raw data. Posterior predictive evaluations can be O(m) for m candidate outcomes (e.g., summing probabilities over k=0..m) but remain independent of n once the posterior is formed. In contrast, sampling-based methods like Metropolis-Hastings MCMC have complexity proportional to the number of iterations T. Each iteration often takes O(1) for univariate models or O(p)–O(p2) for p-dimensional parameters depending on how the likelihood is computed and whether gradients/Hessians are used. The total cost to achieve a desired Monte Carlo standard error depends on mixing; poorly tuned proposals can require far more iterations (larger T) to obtain effectively independent samples. Space for MCMC is O(T) if you store all draws, or O(1) if you stream and keep only running summaries. Numerically stable implementations should compute in log-space to avoid underflow, with constant-time lgamma evaluations aiding Beta/Gamma-based models. Overall, conjugate models are extremely efficient; nonconjugate or high-dimensional models often rely on MCMC or variational methods with higher compute and memory demands.

Code Examples

Beta-Binomial posterior update and predictive distribution
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute log Beta function using lgamma for numerical stability
5double log_beta(double a, double b) {
6 return lgamma(a) + lgamma(b) - lgamma(a + b);
7}
8
9// Compute log binomial coefficient C(n,k) using lgamma
10double log_choose(int n, int k) {
11 if (k < 0 || k > n) return -INFINITY;
12 return lgamma(n + 1.0) - lgamma(k + 1.0) - lgamma(n - k + 1.0);
13}
14
15int main(){
16 ios::sync_with_stdio(false);
17 cin.tie(nullptr);
18
19 // Prior: Beta(alpha, beta)
20 double alpha = 2.0; // prior pseudo-heads
21 double beta = 2.0; // prior pseudo-tails
22
23 // Observed data: n trials, k successes (heads)
24 int n = 20;
25 int k = 14;
26
27 // Posterior hyperparameters
28 double alpha_post = alpha + k;
29 double beta_post = beta + (n - k);
30
31 // Posterior summaries
32 double mean_post = alpha_post / (alpha_post + beta_post);
33 double var_post = (alpha_post * beta_post) /
34 ((alpha_post + beta_post) * (alpha_post + beta_post) * (alpha_post + beta_post + 1.0));
35 // MAP exists only if alpha_post > 1 and beta_post > 1
36 double map_post;
37 if (alpha_post > 1.0 && beta_post > 1.0) {
38 map_post = (alpha_post - 1.0) / (alpha_post + beta_post - 2.0);
39 } else {
40 map_post = mean_post; // fallback
41 }
42
43 // 95% normal-approx credible interval for theta (approximation)
44 double sd_post = sqrt(var_post);
45 double ci_low = max(0.0, mean_post - 1.96 * sd_post);
46 double ci_high = min(1.0, mean_post + 1.96 * sd_post);
47
48 cout << fixed << setprecision(6);
49 cout << "Posterior Beta(" << alpha_post << ", " << beta_post << ")\n";
50 cout << "Posterior mean: " << mean_post << "\n";
51 cout << "Posterior MAP: " << map_post << "\n";
52 cout << "Approx 95% CI: [" << ci_low << ", " << ci_high << "]\n\n";
53
54 // Posterior predictive for m future trials has Beta-Binomial distribution
55 int m = 5; // number of future trials
56 cout << "Predictive P(K=k for m=" << m << ")" << "\n";
57 for (int kfuture = 0; kfuture <= m; ++kfuture) {
58 // log p = log C(m,k) + log B(alpha'+k, beta'+m-k) - log B(alpha', beta')
59 double logp = log_choose(m, kfuture)
60 + log_beta(alpha_post + kfuture, beta_post + (m - kfuture))
61 - log_beta(alpha_post, beta_post);
62 cout << "k=" << kfuture << ": " << exp(logp) << "\n";
63 }
64
65 return 0;
66}
67

This program performs Bayesian updating for a coin-bias problem with a Beta prior and Binomial likelihood. It computes the posterior Beta hyperparameters, the posterior mean and MAP, and an approximate 95% credible interval using the normal approximation of a Beta distribution. It also computes the Beta-Binomial predictive probabilities for k future successes in m trials using log-space computations with lgamma for numerical stability.

Time: O(m) after observing data (updates are O(1); predictive enumerates k=0..m).Space: O(1)
Normal-Normal posterior and predictive (known observation variance)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct PosteriorNormal {
5 double mu_n; // posterior mean
6 double tau2_n; // posterior variance
7};
8
9PosteriorNormal update_normal_normal(double mu0, double tau2_0, double sigma2, const vector<double>& x){
10 int n = (int)x.size();
11 double sumx = accumulate(x.begin(), x.end(), 0.0);
12 double prec0 = 1.0 / tau2_0; // prior precision
13 double precx = n / sigma2; // data precision
14 double precn = prec0 + precx; // posterior precision
15 double mu_n = (prec0 * mu0 + (sumx / sigma2)) / precn;
16 double tau2_n = 1.0 / precn;
17 return {mu_n, tau2_n};
18}
19
20int main(){
21 ios::sync_with_stdio(false);
22 cin.tie(nullptr);
23
24 // Prior: theta ~ N(mu0, tau0^2)
25 double mu0 = 0.0;
26 double tau2_0 = 4.0; // prior variance (tau0^2)
27
28 // Known observation variance sigma^2
29 double sigma2 = 1.0;
30
31 // Observed data
32 vector<double> x = {0.2, -0.1, 0.4, 0.3, -0.2, 0.1};
33
34 // Posterior update
35 PosteriorNormal post = update_normal_normal(mu0, tau2_0, sigma2, x);
36
37 // 95% credible interval (Normal posterior)
38 double sd_post = sqrt(post.tau2_n);
39 double ci_low = post.mu_n - 1.96 * sd_post;
40 double ci_high = post.mu_n + 1.96 * sd_post;
41
42 // Posterior predictive for a new x_new: Normal(mu_n, sigma2 + tau2_n)
43 double pred_mean = post.mu_n;
44 double pred_var = sigma2 + post.tau2_n;
45 double pred_sd = sqrt(pred_var);
46
47 cout << fixed << setprecision(6);
48 cout << "Posterior mean: " << post.mu_n << "\n";
49 cout << "Posterior var: " << post.tau2_n << "\n";
50 cout << "95% CI: [" << ci_low << ", " << ci_high << "]\n";
51 cout << "Predictive mean:" << pred_mean << "; sd=" << pred_sd << "\n";
52
53 return 0;
54}
55

This code updates a Normal prior with Normal likelihood (known observation variance) to obtain a Normal posterior. It computes the posterior mean and variance as precision-weighted combinations of the prior and data, gives a 95% credible interval, and reports the predictive distribution parameters for a future observation.

Time: O(n) to compute the sufficient statistic (sum of x).Space: O(1) additional space beyond the input data.
Metropolis-Hastings sampling for a Beta posterior (univariate MCMC)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log posterior for Beta(alpha', beta') (up to constant)
5double log_posterior_beta(double theta, double a, double b) {
6 if (theta <= 0.0 || theta >= 1.0) return -INFINITY;
7 return (a - 1.0) * log(theta) + (b - 1.0) * log1p(-theta); // log1p(-theta) = log(1 - theta)
8}
9
10// Reflect a point into (0,1) interval for random-walk proposals
11double reflect01(double x){
12 // Handle multiple crossings if step is large
13 while (x < 0.0 || x > 1.0) {
14 if (x < 0.0) x = -x; // reflect below 0
15 if (x > 1.0) x = 2.0 - x; // reflect above 1
16 }
17 // Avoid exactly 0 or 1 which are invalid for log
18 const double eps = 1e-12;
19 if (x <= 0.0) x = eps;
20 if (x >= 1.0) x = 1.0 - eps;
21 return x;
22}
23
24int main(){
25 ios::sync_with_stdio(false);
26 cin.tie(nullptr);
27
28 // Data: k successes out of n trials with Beta(alpha, beta) prior
29 int n = 50, k = 32;
30 double alpha = 2.0, beta = 2.0; // prior
31 double a_post = alpha + k; // posterior alpha'
32 double b_post = beta + (n - k); // posterior beta'
33
34 // MCMC settings
35 int burn_in = 5000;
36 int iters = 50000;
37 int thin = 10;
38 double step_sd = 0.05; // proposal std dev for random-walk in theta-space
39
40 mt19937_64 rng(random_device{}());
41 normal_distribution<double> norm(0.0, step_sd);
42 uniform_real_distribution<double> unif(0.0, 1.0);
43
44 // Initialize at posterior mean
45 double theta = a_post / (a_post + b_post);
46 vector<double> samples;
47 samples.reserve((iters - burn_in) / max(1, thin));
48
49 int accepted = 0;
50 for (int t = 0; t < iters; ++t) {
51 double prop = reflect01(theta + norm(rng));
52 double logp_cur = log_posterior_beta(theta, a_post, b_post);
53 double logp_prop = log_posterior_beta(prop, a_post, b_post);
54 double loga = logp_prop - logp_cur; // symmetric proposal -> Hastings ratio cancels
55 if (loga >= 0.0 || log(unif(rng)) < loga) {
56 theta = prop;
57 if (t >= burn_in) ++accepted;
58 }
59 if (t >= burn_in && ((t - burn_in) % thin == 0)) {
60 samples.push_back(theta);
61 }
62 }
63
64 // Compute posterior mean estimate from samples
65 double mean_est = accumulate(samples.begin(), samples.end(), 0.0) / samples.size();
66
67 cout << fixed << setprecision(6);
68 cout << "Samples kept: " << samples.size() << "\n";
69 cout << "Acceptance rate (post-burn-in): "
70 << (double)accepted / max(1, iters - burn_in) << "\n";
71 cout << "Posterior mean (MC): " << mean_est << "\n";
72 cout << "Posterior mean (analytic): " << a_post / (a_post + b_post) << "\n";
73
74 return 0;
75}
76

This example samples from a univariate Beta posterior using a Metropolis-Hastings random-walk proposal. It demonstrates how to construct a log-posterior, handle parameter constraints with reflection into (0,1), run burn-in and thinning, and estimate posterior means from samples. The Monte Carlo estimate is compared with the analytic posterior mean for validation.

Time: O(T) for T iterations; each step is O(1).Space: O(T/thin) to store samples, or O(1) if only streaming summaries are kept.
#bayesian inference#posterior#prior#likelihood#evidence#conjugate prior#posterior predictive#mcmc#metropolis-hastings#map estimate#beta-binomial#normal-normal#credible interval#marginal likelihood#monte carlo