Importance Sampling
Key Points
- •Importance sampling rewrites an expectation under a hard-to-sample distribution p as an expectation under an easier distribution q, multiplied by a weight w = p/q.
- •It is most useful when estimating rare events or handling unnormalized targets, where plain Monte Carlo would be very inefficient.
- •Good proposal design is critical: q must cover the support of p and place mass where f(x)p(x) is large to reduce variance.
- •Self-normalized importance sampling handles unnormalized targets by normalizing the weights and yields a slightly biased but consistent estimator.
- •Weights can be extremely variable; log-weights and numerical stabilization (e.g., log-sum-exp) are essential in C++ implementations.
- •Effective Sample Size (ESS) diagnoses weight degeneracy; resampling can convert weighted samples into approximately unweighted ones.
- •Time complexity is O(N) per estimate with N samples, but statistical efficiency (variance) depends heavily on the choice of q.
- •A poor proposal can give infinite-variance estimators, producing wildly unstable results even with many samples.
Prerequisites
- →Probability density and expectation — Importance sampling rewrites expectations under one distribution as weighted expectations under another.
- →Monte Carlo estimation and Law of Large Numbers — IS builds on Monte Carlo averages and their convergence properties.
- →Variance and Central Limit Theorem — Understanding estimator variance and asymptotic normality is key for accuracy and confidence intervals.
- →Common distributions (Normal, Uniform) and their PDFs — Implementations require evaluating p(x) and q(x) accurately.
- →Log-space computations and log-sum-exp — Weights can underflow/overflow; stable log computations are essential in C++.
- →C++ random number generation — Drawing independent samples from proposal distributions is the backbone of IS.
- →Mixture distributions — Many practical proposals and targets are mixtures; understanding them aids design.
- →KL divergence and proposal tuning — Designing good proposals often involves minimizing divergence from an ideal target.
Detailed Explanation
Tap terms for definitions01Overview
Importance sampling is a Monte Carlo technique that estimates expectations with respect to a difficult target distribution p by sampling from a more convenient proposal distribution q and correcting with importance weights. If we want E_p[f(X)], but sampling from p is hard or the event of interest is rare, we instead draw samples X_i from q and compute a weighted average using weights w(X_i) = p(X_i)/q(X_i). This turns the original expectation into E_q[f(X)w(X)], which we can approximate with Monte Carlo. The method is simple, flexible, and broadly applicable: it works for continuous or discrete distributions, normalized or unnormalized targets (via self-normalized importance sampling), and in high-dimensional problems when carefully designed proposals are used. Its power lies in variance reduction: by choosing q to sample more often where f(x)p(x) is large, we get accurate estimates with far fewer samples than crude Monte Carlo. However, importance sampling is not magic—bad proposal choices can lead to enormous variance or even divergence. Practical implementations therefore use diagnostic tools (e.g., ESS), numerically stable log-weight computations, and, in sequential settings, resampling to combat weight degeneracy.
02Intuition & Analogies
Imagine you are trying to estimate how many rare birds are in a huge forest. If you wander randomly (crude Monte Carlo), you will almost never see one, and your estimate will be noisy. A smarter strategy is to spend more time in places where birds are likely to be—near rivers and fruit trees—and then correct for your biased search by down-weighting common sightings and up-weighting rare ones. Importance sampling does exactly that for probabilities and expectations: it searches more in the "interesting" parts of the space while keeping the results unbiased (or nearly unbiased) through weights. Another everyday analogy: suppose you want to know the average price of a specific rare collectible across all shops in a city. Visiting random shops is inefficient because most don’t carry it. Instead, you visit specialty stores (proposal q) more often, record prices, and then account for the fact that you oversampled specialty stores by weighting each observation inversely by how likely your strategy would pick that shop relative to the city-wide distribution (target p). The hopeful payoff is lower variability: you replace many useless samples with fewer, more informative ones. But there’s a catch: if you only visit one neighborhood (q has limited support), you may miss important shops entirely (regions where p > 0 but q = 0), making your estimate invalid. Even if you visit the right areas, if you visit them with wildly wrong frequencies, your weights can explode, and your average becomes dominated by a few samples, making it noisy. Good strategy (proposal design) is everything.
03Formal Definition
04When to Use
Use importance sampling when direct sampling from p is hard, naïve Monte Carlo is inefficient, or the target is known only up to a constant. Typical scenarios include: (1) Rare-event probability estimation (e.g., P(X > a) for large a in a normal tail), where shifting or tilting q places more mass in the tail. (2) Bayesian inference with intractable normalizing constants, where the posterior is known only up to proportionality; self-normalized importance sampling approximates posterior expectations. (3) Option pricing and risk estimation in quantitative finance, where payoffs concentrate in regions poorly explored by standard draws. (4) Off-policy evaluation in reinforcement learning, where data come from a behavior policy q but we need expectations under a target policy p. (5) Rendering in computer graphics (light transport), using multiple importance sampling to reduce noise by combining several proposals. (6) Particle filters in robotics and tracking, where importance weights update a set of particles as new observations arrive, often with resampling to combat degeneracy. It is less suitable when high-dimensional targets are sharply concentrated along complex manifolds unless q is very well matched; in such cases, MCMC or specialized proposals may be preferable.
⚠️Common Mistakes
• Support mismatch: using a proposal q that is zero where p is positive (or where f(x)p(x) is nonzero). This violates absolute continuity and makes the estimator undefined; always ensure q(x) > 0 whenever f(x)p(x) > 0. • Exploding variance: choosing q far from the optimal shape causes a few samples to carry huge weights, leading to unstable results. Monitor weight variance and Effective Sample Size (ESS), and redesign q (e.g., shift, scale, or use mixtures). • Ignoring normalization: when p is known only up to a constant, using the plain (non-normalized) estimator is wrong; use self-normalized importance sampling. • Numerical underflow/overflow: directly computing p(x)/q(x) can underflow for tails; compute log-weights and apply the log-sum-exp trick before exponentiating. • Misinterpreting bias: the self-normalized estimator is slightly biased (O(1/N)), so standard error formulas for unbiased estimators do not apply directly; use appropriate variance estimators or the delta method. • Too few samples for rare events: even with a shifted proposal, very small probabilities may still need many samples; cross-check with variance estimates. • Overfitting the proposal on the same samples (double dipping): design or tune q using separate pilot runs or adaptive schemes with care to preserve validity. • Forgetting to check ESS and not resampling in sequential settings, letting a single particle dominate; use resampling when ESS is low.
Key Formulas
Importance Sampling Identity
Explanation: This is the core equality enabling importance sampling. It lets us replace sampling from p with sampling from q and reweighting by p/q.
IS Monte Carlo Estimator
Explanation: The sample average of f(X) times the weight w(X) estimates the desired expectation. It is unbiased if the second moment under q is finite.
IS Variance
Explanation: The variance of the IS estimator decreases like 1/N, but the constant depends on how well q matches |f|p. Poor matches inflate this variance.
Self-Normalized IS
Explanation: Used when p is known only up to a constant. The weights are normalized by their sum, yielding a consistent but slightly biased estimator.
Optimal Proposal
Explanation: In theory, sampling from this distribution gives zero-variance estimates. Practically, it guides proposal design toward regions where |f|p is large.
Effective Sample Size
Explanation: ESS estimates how many equally weighted samples the weighted set is worth. Small ESS indicates severe weight degeneracy and high variance.
Log-Sum-Exp Trick
Explanation: This identity prevents numerical overflow/underflow when summing exponentials. It is crucial for stable log-weight computations.
CLT for IS
Explanation: Under finite variance, the IS estimator is asymptotically normal, enabling confidence intervals based on estimated variance.
Log-Weight Stabilization
Explanation: Computing weights in log-space avoids catastrophic underflow in tails. Subtracting a common constant from log-weights further stabilizes exponentiation.
Self-Normalized Bias
Explanation: The self-normalized estimator has a small bias decreasing as 1/N, while remaining consistent. This matters for precise uncertainty quantification.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Log-density of Normal(mu, sigma) 5 double log_normal_pdf(double x, double mu, double sigma) { 6 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI); 7 double z = (x - mu) / sigma; 8 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z; 9 } 10 11 int main() { 12 ios::sync_with_stdio(false); 13 cin.tie(nullptr); 14 15 const int N = 200000; // number of importance samples 16 17 // Target: p = N(0,1); Proposal: q = N(5,1) (shifted to the right tail) 18 double p_mu = 0.0, p_sigma = 1.0; 19 double q_mu = 5.0, q_sigma = 1.0; 20 21 std::mt19937 rng((std::random_device())()); 22 std::normal_distribution<double> q_dist(q_mu, q_sigma); 23 24 // IS estimator for P(Z > 5) with Z ~ N(0,1) 25 // f(x) = 1_{x > 5} 26 double sum_fw = 0.0; 27 vector<double> contrib; contrib.reserve(N); 28 29 for (int i = 0; i < N; ++i) { 30 double x = q_dist(rng); // sample from proposal q 31 double logp = log_normal_pdf(x, p_mu, p_sigma); 32 double logq = log_normal_pdf(x, q_mu, q_sigma); 33 double w = exp(logp - logq); // importance weight (stable via logs) 34 double f = (x > 5.0) ? 1.0 : 0.0; 35 double term = f * w; 36 sum_fw += term; 37 contrib.push_back(term); 38 } 39 40 double is_estimate = sum_fw / N; 41 42 // Estimate standard error using sample variance of contributions 43 double mean = is_estimate; 44 double sq_sum = 0.0; 45 for (double t : contrib) sq_sum += (t - mean) * (t - mean); 46 double sample_var = sq_sum / (N - 1); 47 double stderr = sqrt(sample_var / N); 48 49 cout.setf(std::ios::scientific); cout << setprecision(6); 50 cout << "IS estimate P(Z>5): " << is_estimate << "\n"; 51 cout << "Std. error (approx): " << stderr << "\n"; 52 53 // For reference: true probability ~ 2.867e-7 (erfc) 54 // We can compute a high-precision approximation using complementary error function 55 double true_prob = 0.5 * erfc(5.0 / sqrt(2.0)); 56 cout << "True P(Z>5): " << true_prob << "\n"; 57 58 return 0; 59 } 60
We estimate a tiny Gaussian tail probability by sampling from a shifted normal proposal that places most samples in the tail. Each sample is reweighted by p(x)/q(x), computed in log-space for numerical stability. The estimator averages f(x)w(x), where f is the indicator of x > 5, and we report an approximate standard error.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Log of unnormalized target: \tilde{p}(x) = exp(-x^2/2) + exp(-(x-4)^2/2) 5 // Uses log-sum-exp for stability. 6 double log_unnorm_p(double x) { 7 double a = -0.5 * x * x; // log term 1 8 double b = -0.5 * (x - 4.0) * (x - 4.0); // log term 2 9 double m = max(a, b); 10 return m + log(exp(a - m) + exp(b - m)); 11 } 12 13 // Log-density of Normal(mu, sigma) 14 double log_normal_pdf(double x, double mu, double sigma) { 15 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI); 16 double z = (x - mu) / sigma; 17 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z; 18 } 19 20 int main() { 21 ios::sync_with_stdio(false); 22 cin.tie(nullptr); 23 24 const int N = 150000; // number of samples 25 26 // Proposal q = Normal(mean=2, sd=3) to cover both modes 27 double q_mu = 2.0, q_sigma = 3.0; 28 std::mt19937 rng((std::random_device())()); 29 std::normal_distribution<double> q_dist(q_mu, q_sigma); 30 31 vector<double> xs; xs.reserve(N); 32 vector<double> logw; logw.reserve(N); 33 34 // Draw samples and compute log-weights: logw_i = log \tilde{p}(x_i) - log q(x_i) 35 for (int i = 0; i < N; ++i) { 36 double x = q_dist(rng); 37 double lw = log_unnorm_p(x) - log_normal_pdf(x, q_mu, q_sigma); 38 xs.push_back(x); 39 logw.push_back(lw); 40 } 41 42 // Stabilize weights: subtract max log-weight before exponentiation 43 double maxlw = *max_element(logw.begin(), logw.end()); 44 vector<double> w(N); 45 double sumw = 0.0; 46 for (int i = 0; i < N; ++i) { 47 w[i] = exp(logw[i] - maxlw); 48 sumw += w[i]; 49 } 50 51 // Self-normalized estimator of E[X] 52 double num = 0.0; 53 for (int i = 0; i < N; ++i) num += w[i] * xs[i]; 54 double est_EX = num / sumw; 55 56 // Effective Sample Size (ESS) 57 double sumw2 = 0.0; for (double wi : w) sumw2 += wi * wi; 58 double ESS = (sumw * sumw) / sumw2; 59 60 cout.setf(std::ios::fixed); cout << setprecision(6); 61 cout << "Self-normalized IS estimate of E[X]: " << est_EX << "\n"; 62 cout << "Sum of weights (unnormalized): " << sumw << "\n"; 63 cout << "Effective Sample Size (ESS): " << ESS << " / " << N << "\n"; 64 65 return 0; 66 } 67
We cannot easily sample from the bimodal target, known only up to proportionality. We draw samples from a wide normal proposal, compute self-normalized importance weights in log-space, and estimate E[X]. We also report ESS to diagnose weight concentration.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Systematic resampling: input normalized weights (sum to 1); returns N indices 5 vector<int> systematic_resample(const vector<double>& w, std::mt19937& rng) { 6 int N = (int)w.size(); 7 vector<int> idx(N); 8 vector<double> cdf(N); 9 partial_sum(w.begin(), w.end(), cdf.begin()); 10 std::uniform_real_distribution<double> U(0.0, 1.0 / N); 11 double u0 = U(rng); 12 int i = 0; 13 for (int j = 0; j < N; ++j) { 14 double uj = u0 + (double)j / N; 15 while (i < N - 1 && uj > cdf[i]) ++i; 16 idx[j] = i; 17 } 18 return idx; 19 } 20 21 int main() { 22 ios::sync_with_stdio(false); 23 cin.tie(nullptr); 24 25 const int N = 50000; 26 27 // Target: Beta(3,2) on [0,1]; up to constant, \tilde{p}(x) = x^{2} (1-x)^{1} 28 // Proposal: Uniform(0,1); q(x) = 1 on [0,1] 29 std::mt19937 rng((std::random_device())()); 30 std::uniform_real_distribution<double> U01(0.0, 1.0); 31 32 vector<double> x(N), w(N); 33 for (int i = 0; i < N; ++i) { 34 x[i] = U01(rng); 35 // Unnormalized weight proportional to target / proposal = x^2 (1-x) 36 w[i] = x[i] * x[i] * (1.0 - x[i]); 37 } 38 39 // Normalize weights 40 double sumw = accumulate(w.begin(), w.end(), 0.0); 41 for (double &wi : w) wi /= sumw; 42 43 // Weighted estimator of E[X] 44 double EX_weighted = 0.0; 45 for (int i = 0; i < N; ++i) EX_weighted += w[i] * x[i]; 46 47 // ESS 48 double sumw2 = 0.0; for (double wi : w) sumw2 += wi * wi; 49 double ESS = 1.0 / sumw2; 50 51 // Resample to get approximately unweighted samples from target 52 vector<int> idx = systematic_resample(w, rng); 53 vector<double> x_resamp(N); 54 for (int i = 0; i < N; ++i) x_resamp[i] = x[idx[i]]; 55 56 // Unweighted estimator from resampled particles 57 double EX_resamp = accumulate(x_resamp.begin(), x_resamp.end(), 0.0) / N; 58 59 // True E[X] for Beta(3,2) is alpha/(alpha+beta) = 3/5 = 0.6 60 cout.setf(std::ios::fixed); cout << setprecision(6); 61 cout << "Weighted E[X]: " << EX_weighted << "\n"; 62 cout << "Resampled E[X]: " << EX_resamp << "\n"; 63 cout << "ESS (out of N): " << ESS << " / " << N << "\n"; 64 cout << "True E[X]: 0.600000\n"; 65 66 return 0; 67 } 68
We target Beta(3,2) using a Uniform proposal on [0,1]. We compute normalized importance weights proportional to x^2(1-x), estimate E[X] with a weighted average, and then perform systematic resampling to obtain an unweighted sample set. Both estimates should be close to the true mean 0.6, and ESS quantifies weight concentration.