🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmIntermediate

Importance Sampling

Key Points

  • •
    Importance sampling rewrites an expectation under a hard-to-sample distribution p as an expectation under an easier distribution q, multiplied by a weight w = p/q.
  • •
    It is most useful when estimating rare events or handling unnormalized targets, where plain Monte Carlo would be very inefficient.
  • •
    Good proposal design is critical: q must cover the support of p and place mass where f(x)p(x) is large to reduce variance.
  • •
    Self-normalized importance sampling handles unnormalized targets by normalizing the weights and yields a slightly biased but consistent estimator.
  • •
    Weights can be extremely variable; log-weights and numerical stabilization (e.g., log-sum-exp) are essential in C++ implementations.
  • •
    Effective Sample Size (ESS) diagnoses weight degeneracy; resampling can convert weighted samples into approximately unweighted ones.
  • •
    Time complexity is O(N) per estimate with N samples, but statistical efficiency (variance) depends heavily on the choice of q.
  • •
    A poor proposal can give infinite-variance estimators, producing wildly unstable results even with many samples.

Prerequisites

  • →Probability density and expectation — Importance sampling rewrites expectations under one distribution as weighted expectations under another.
  • →Monte Carlo estimation and Law of Large Numbers — IS builds on Monte Carlo averages and their convergence properties.
  • →Variance and Central Limit Theorem — Understanding estimator variance and asymptotic normality is key for accuracy and confidence intervals.
  • →Common distributions (Normal, Uniform) and their PDFs — Implementations require evaluating p(x) and q(x) accurately.
  • →Log-space computations and log-sum-exp — Weights can underflow/overflow; stable log computations are essential in C++.
  • →C++ random number generation — Drawing independent samples from proposal distributions is the backbone of IS.
  • →Mixture distributions — Many practical proposals and targets are mixtures; understanding them aids design.
  • →KL divergence and proposal tuning — Designing good proposals often involves minimizing divergence from an ideal target.

Detailed Explanation

Tap terms for definitions

01Overview

Importance sampling is a Monte Carlo technique that estimates expectations with respect to a difficult target distribution p by sampling from a more convenient proposal distribution q and correcting with importance weights. If we want E_p[f(X)], but sampling from p is hard or the event of interest is rare, we instead draw samples X_i from q and compute a weighted average using weights w(X_i) = p(X_i)/q(X_i). This turns the original expectation into E_q[f(X)w(X)], which we can approximate with Monte Carlo. The method is simple, flexible, and broadly applicable: it works for continuous or discrete distributions, normalized or unnormalized targets (via self-normalized importance sampling), and in high-dimensional problems when carefully designed proposals are used. Its power lies in variance reduction: by choosing q to sample more often where f(x)p(x) is large, we get accurate estimates with far fewer samples than crude Monte Carlo. However, importance sampling is not magic—bad proposal choices can lead to enormous variance or even divergence. Practical implementations therefore use diagnostic tools (e.g., ESS), numerically stable log-weight computations, and, in sequential settings, resampling to combat weight degeneracy.

02Intuition & Analogies

Imagine you are trying to estimate how many rare birds are in a huge forest. If you wander randomly (crude Monte Carlo), you will almost never see one, and your estimate will be noisy. A smarter strategy is to spend more time in places where birds are likely to be—near rivers and fruit trees—and then correct for your biased search by down-weighting common sightings and up-weighting rare ones. Importance sampling does exactly that for probabilities and expectations: it searches more in the "interesting" parts of the space while keeping the results unbiased (or nearly unbiased) through weights. Another everyday analogy: suppose you want to know the average price of a specific rare collectible across all shops in a city. Visiting random shops is inefficient because most don’t carry it. Instead, you visit specialty stores (proposal q) more often, record prices, and then account for the fact that you oversampled specialty stores by weighting each observation inversely by how likely your strategy would pick that shop relative to the city-wide distribution (target p). The hopeful payoff is lower variability: you replace many useless samples with fewer, more informative ones. But there’s a catch: if you only visit one neighborhood (q has limited support), you may miss important shops entirely (regions where p > 0 but q = 0), making your estimate invalid. Even if you visit the right areas, if you visit them with wildly wrong frequencies, your weights can explode, and your average becomes dominated by a few samples, making it noisy. Good strategy (proposal design) is everything.

03Formal Definition

Let p be a target distribution on a measurable space with density p(x) (or probability mass function), and let f be an integrable function. For any proposal distribution q with density q(x) satisfying absolute continuity (i.e., p(x) > 0 implies q(x) > 0), we have the identity: Ep​[f(X)] = Eq​[f(X) w(X)], where w(x) = p(x)/q(x). The Monte Carlo estimator with N i.i.d. draws X1​,...,XN​ ~ q is μ^​IS​ = N1​ ∑i=1N​ f(Xi​) w(Xi​), which is unbiased when Eq​[∣f(X)w(X)∣] < ∞. Its variance is Var(μ^​IS​) = N1​ \operatorname{Var}_q\big(f(X)w(X)\big). When p is known only up to a normalizing constant, p(x) = p~​(x)/Z, we use self-normalized importance sampling: μ^​SN​ = ∑i=1N​w~i​∑i=1N​f(Xi​)w~i​​, with w~i​ = p~​(Xi​)/q(Xi​). This estimator is consistent but slightly biased, with bias ON1​ under regularity conditions. An optimal proposal that minimizes variance for a fixed N (when it exists) is q^*(x) ∝ ∣f(x)∣ p(x), yielding zero-variance if we could sample from it exactly. In practice, q is chosen from a tractable family (e.g., Gaussian, mixture) to approximate q^* while maintaining support coverage and computational efficiency.

04When to Use

Use importance sampling when direct sampling from p is hard, naïve Monte Carlo is inefficient, or the target is known only up to a constant. Typical scenarios include: (1) Rare-event probability estimation (e.g., P(X > a) for large a in a normal tail), where shifting or tilting q places more mass in the tail. (2) Bayesian inference with intractable normalizing constants, where the posterior is known only up to proportionality; self-normalized importance sampling approximates posterior expectations. (3) Option pricing and risk estimation in quantitative finance, where payoffs concentrate in regions poorly explored by standard draws. (4) Off-policy evaluation in reinforcement learning, where data come from a behavior policy q but we need expectations under a target policy p. (5) Rendering in computer graphics (light transport), using multiple importance sampling to reduce noise by combining several proposals. (6) Particle filters in robotics and tracking, where importance weights update a set of particles as new observations arrive, often with resampling to combat degeneracy. It is less suitable when high-dimensional targets are sharply concentrated along complex manifolds unless q is very well matched; in such cases, MCMC or specialized proposals may be preferable.

⚠️Common Mistakes

• Support mismatch: using a proposal q that is zero where p is positive (or where f(x)p(x) is nonzero). This violates absolute continuity and makes the estimator undefined; always ensure q(x) > 0 whenever f(x)p(x) > 0. • Exploding variance: choosing q far from the optimal shape causes a few samples to carry huge weights, leading to unstable results. Monitor weight variance and Effective Sample Size (ESS), and redesign q (e.g., shift, scale, or use mixtures). • Ignoring normalization: when p is known only up to a constant, using the plain (non-normalized) estimator is wrong; use self-normalized importance sampling. • Numerical underflow/overflow: directly computing p(x)/q(x) can underflow for tails; compute log-weights and apply the log-sum-exp trick before exponentiating. • Misinterpreting bias: the self-normalized estimator is slightly biased (O(1/N)), so standard error formulas for unbiased estimators do not apply directly; use appropriate variance estimators or the delta method. • Too few samples for rare events: even with a shifted proposal, very small probabilities may still need many samples; cross-check with variance estimates. • Overfitting the proposal on the same samples (double dipping): design or tune q using separate pilot runs or adaptive schemes with care to preserve validity. • Forgetting to check ESS and not resampling in sequential settings, letting a single particle dominate; use resampling when ESS is low.

Key Formulas

Importance Sampling Identity

Ep​[f(X)]=Eq​[f(X)w(X)],w(x)=q(x)p(x)​

Explanation: This is the core equality enabling importance sampling. It lets us replace sampling from p with sampling from q and reweighting by p/q.

IS Monte Carlo Estimator

μ^​IS​=N1​i=1∑N​f(Xi​)w(Xi​),Xi​∼iidq

Explanation: The sample average of f(X) times the weight w(X) estimates the desired expectation. It is unbiased if the second moment under q is finite.

IS Variance

Var(μ^​IS​)=N1​Varq​(f(X)w(X))

Explanation: The variance of the IS estimator decreases like 1/N, but the constant depends on how well q matches |f|p. Poor matches inflate this variance.

Self-Normalized IS

μ^​SN​=∑i=1N​w~i​∑i=1N​f(Xi​)w~i​​,w~i​=q(Xi​)p~​(Xi​)​

Explanation: Used when p is known only up to a constant. The weights are normalized by their sum, yielding a consistent but slightly biased estimator.

Optimal Proposal

q∗(x)∝∣f(x)∣p(x)

Explanation: In theory, sampling from this distribution gives zero-variance estimates. Practically, it guides proposal design toward regions where |f|p is large.

Effective Sample Size

Neff​=∑i=1N​wi2​(∑i=1N​wi​)2​=∑i=1N​wˉi2​1​

Explanation: ESS estimates how many equally weighted samples the weighted set is worth. Small ESS indicates severe weight degeneracy and high variance.

Log-Sum-Exp Trick

log(i=1∑k​eai​)=m+log(i=1∑k​eai​−m),m=imax​ai​

Explanation: This identity prevents numerical overflow/underflow when summing exponentials. It is crucial for stable log-weight computations.

CLT for IS

N​(μ^​IS​−μ)d​N(0,Varq​(f(X)w(X)))

Explanation: Under finite variance, the IS estimator is asymptotically normal, enabling confidence intervals based on estimated variance.

Log-Weight Stabilization

w(x)=exp(logp(x)−logq(x))

Explanation: Computing weights in log-space avoids catastrophic underflow in tails. Subtracting a common constant from log-weights further stabilizes exponentiation.

Self-Normalized Bias

Bias(μ^​SN​)=O(N1​)

Explanation: The self-normalized estimator has a small bias decreasing as 1/N, while remaining consistent. This matters for precise uncertainty quantification.

Complexity Analysis

Let N be the number of samples drawn from the proposal q. The core importance sampling loop evaluates f(xi​), p(xi​), and q(xi​) for each sample xi​, then accumulates a weighted sum. Assuming O(1) time to generate a random sample and O(1) time to evaluate densities and f, the total time is O(N). In practice, the dominant costs are random number generation and log-density evaluations; if f or densities are expensive (e.g., involving linear algebra in high dimensions), per-sample cost may be higher but still scales linearly with N. Memory usage can be O(1) if we stream through samples and only maintain running sums, which is typical for the unbiased IS estimator. For self-normalized IS or diagnostics (e.g., ESS, resampling), we often store all N weights and samples, raising memory to O(N). Computing ESS requires O(N) additional time for sums of weights and squares. If we perform resampling (e.g., systematic resampling), it is also O(N) time and O(N) space. The algorithmic complexity does not reflect statistical efficiency. A poor proposal can inflate the variance Varq​(f(X)w(X)) so much that many more samples are needed to achieve a desired error tolerance. Conversely, a well-designed proposal can reduce variance dramatically, lowering the necessary N and thus the actual runtime to reach a target accuracy. Numerically stable implementations that use log-weights add small constant overhead (a few arithmetic and transcendental operations) but are essential to prevent underflow in tail regions. Parallelization is straightforward because samples are independent; multi-threading or SIMD can reduce wall-clock time without changing asymptotic complexity.

Code Examples

Rare-event probability via importance sampling: P(Z > 5) for Z ~ N(0,1)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log-density of Normal(mu, sigma)
5double log_normal_pdf(double x, double mu, double sigma) {
6 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI);
7 double z = (x - mu) / sigma;
8 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z;
9}
10
11int main() {
12 ios::sync_with_stdio(false);
13 cin.tie(nullptr);
14
15 const int N = 200000; // number of importance samples
16
17 // Target: p = N(0,1); Proposal: q = N(5,1) (shifted to the right tail)
18 double p_mu = 0.0, p_sigma = 1.0;
19 double q_mu = 5.0, q_sigma = 1.0;
20
21 std::mt19937 rng((std::random_device())());
22 std::normal_distribution<double> q_dist(q_mu, q_sigma);
23
24 // IS estimator for P(Z > 5) with Z ~ N(0,1)
25 // f(x) = 1_{x > 5}
26 double sum_fw = 0.0;
27 vector<double> contrib; contrib.reserve(N);
28
29 for (int i = 0; i < N; ++i) {
30 double x = q_dist(rng); // sample from proposal q
31 double logp = log_normal_pdf(x, p_mu, p_sigma);
32 double logq = log_normal_pdf(x, q_mu, q_sigma);
33 double w = exp(logp - logq); // importance weight (stable via logs)
34 double f = (x > 5.0) ? 1.0 : 0.0;
35 double term = f * w;
36 sum_fw += term;
37 contrib.push_back(term);
38 }
39
40 double is_estimate = sum_fw / N;
41
42 // Estimate standard error using sample variance of contributions
43 double mean = is_estimate;
44 double sq_sum = 0.0;
45 for (double t : contrib) sq_sum += (t - mean) * (t - mean);
46 double sample_var = sq_sum / (N - 1);
47 double stderr = sqrt(sample_var / N);
48
49 cout.setf(std::ios::scientific); cout << setprecision(6);
50 cout << "IS estimate P(Z>5): " << is_estimate << "\n";
51 cout << "Std. error (approx): " << stderr << "\n";
52
53 // For reference: true probability ~ 2.867e-7 (erfc)
54 // We can compute a high-precision approximation using complementary error function
55 double true_prob = 0.5 * erfc(5.0 / sqrt(2.0));
56 cout << "True P(Z>5): " << true_prob << "\n";
57
58 return 0;
59}
60

We estimate a tiny Gaussian tail probability by sampling from a shifted normal proposal that places most samples in the tail. Each sample is reweighted by p(x)/q(x), computed in log-space for numerical stability. The estimator averages f(x)w(x), where f is the indicator of x > 5, and we report an approximate standard error.

Time: O(N)Space: O(N) to store contributions for variance; O(1) if streaming without error bars
Self-normalized importance sampling for an unnormalized bimodal target: estimate E[X]
1#include <bits/stdc++.h>
2using namespace std;
3
4// Log of unnormalized target: \tilde{p}(x) = exp(-x^2/2) + exp(-(x-4)^2/2)
5// Uses log-sum-exp for stability.
6double log_unnorm_p(double x) {
7 double a = -0.5 * x * x; // log term 1
8 double b = -0.5 * (x - 4.0) * (x - 4.0); // log term 2
9 double m = max(a, b);
10 return m + log(exp(a - m) + exp(b - m));
11}
12
13// Log-density of Normal(mu, sigma)
14double log_normal_pdf(double x, double mu, double sigma) {
15 static const double LOG_SQRT_2PI = 0.5 * log(2.0 * M_PI);
16 double z = (x - mu) / sigma;
17 return -LOG_SQRT_2PI - log(sigma) - 0.5 * z * z;
18}
19
20int main() {
21 ios::sync_with_stdio(false);
22 cin.tie(nullptr);
23
24 const int N = 150000; // number of samples
25
26 // Proposal q = Normal(mean=2, sd=3) to cover both modes
27 double q_mu = 2.0, q_sigma = 3.0;
28 std::mt19937 rng((std::random_device())());
29 std::normal_distribution<double> q_dist(q_mu, q_sigma);
30
31 vector<double> xs; xs.reserve(N);
32 vector<double> logw; logw.reserve(N);
33
34 // Draw samples and compute log-weights: logw_i = log \tilde{p}(x_i) - log q(x_i)
35 for (int i = 0; i < N; ++i) {
36 double x = q_dist(rng);
37 double lw = log_unnorm_p(x) - log_normal_pdf(x, q_mu, q_sigma);
38 xs.push_back(x);
39 logw.push_back(lw);
40 }
41
42 // Stabilize weights: subtract max log-weight before exponentiation
43 double maxlw = *max_element(logw.begin(), logw.end());
44 vector<double> w(N);
45 double sumw = 0.0;
46 for (int i = 0; i < N; ++i) {
47 w[i] = exp(logw[i] - maxlw);
48 sumw += w[i];
49 }
50
51 // Self-normalized estimator of E[X]
52 double num = 0.0;
53 for (int i = 0; i < N; ++i) num += w[i] * xs[i];
54 double est_EX = num / sumw;
55
56 // Effective Sample Size (ESS)
57 double sumw2 = 0.0; for (double wi : w) sumw2 += wi * wi;
58 double ESS = (sumw * sumw) / sumw2;
59
60 cout.setf(std::ios::fixed); cout << setprecision(6);
61 cout << "Self-normalized IS estimate of E[X]: " << est_EX << "\n";
62 cout << "Sum of weights (unnormalized): " << sumw << "\n";
63 cout << "Effective Sample Size (ESS): " << ESS << " / " << N << "\n";
64
65 return 0;
66}
67

We cannot easily sample from the bimodal target, known only up to proportionality. We draw samples from a wide normal proposal, compute self-normalized importance weights in log-space, and estimate E[X]. We also report ESS to diagnose weight concentration.

Time: O(N)Space: O(N) for storing samples and weights
From weighted to unweighted samples: resampling for Beta(3,2) target using Uniform proposal
1#include <bits/stdc++.h>
2using namespace std;
3
4// Systematic resampling: input normalized weights (sum to 1); returns N indices
5vector<int> systematic_resample(const vector<double>& w, std::mt19937& rng) {
6 int N = (int)w.size();
7 vector<int> idx(N);
8 vector<double> cdf(N);
9 partial_sum(w.begin(), w.end(), cdf.begin());
10 std::uniform_real_distribution<double> U(0.0, 1.0 / N);
11 double u0 = U(rng);
12 int i = 0;
13 for (int j = 0; j < N; ++j) {
14 double uj = u0 + (double)j / N;
15 while (i < N - 1 && uj > cdf[i]) ++i;
16 idx[j] = i;
17 }
18 return idx;
19}
20
21int main() {
22 ios::sync_with_stdio(false);
23 cin.tie(nullptr);
24
25 const int N = 50000;
26
27 // Target: Beta(3,2) on [0,1]; up to constant, \tilde{p}(x) = x^{2} (1-x)^{1}
28 // Proposal: Uniform(0,1); q(x) = 1 on [0,1]
29 std::mt19937 rng((std::random_device())());
30 std::uniform_real_distribution<double> U01(0.0, 1.0);
31
32 vector<double> x(N), w(N);
33 for (int i = 0; i < N; ++i) {
34 x[i] = U01(rng);
35 // Unnormalized weight proportional to target / proposal = x^2 (1-x)
36 w[i] = x[i] * x[i] * (1.0 - x[i]);
37 }
38
39 // Normalize weights
40 double sumw = accumulate(w.begin(), w.end(), 0.0);
41 for (double &wi : w) wi /= sumw;
42
43 // Weighted estimator of E[X]
44 double EX_weighted = 0.0;
45 for (int i = 0; i < N; ++i) EX_weighted += w[i] * x[i];
46
47 // ESS
48 double sumw2 = 0.0; for (double wi : w) sumw2 += wi * wi;
49 double ESS = 1.0 / sumw2;
50
51 // Resample to get approximately unweighted samples from target
52 vector<int> idx = systematic_resample(w, rng);
53 vector<double> x_resamp(N);
54 for (int i = 0; i < N; ++i) x_resamp[i] = x[idx[i]];
55
56 // Unweighted estimator from resampled particles
57 double EX_resamp = accumulate(x_resamp.begin(), x_resamp.end(), 0.0) / N;
58
59 // True E[X] for Beta(3,2) is alpha/(alpha+beta) = 3/5 = 0.6
60 cout.setf(std::ios::fixed); cout << setprecision(6);
61 cout << "Weighted E[X]: " << EX_weighted << "\n";
62 cout << "Resampled E[X]: " << EX_resamp << "\n";
63 cout << "ESS (out of N): " << ESS << " / " << N << "\n";
64 cout << "True E[X]: 0.600000\n";
65
66 return 0;
67}
68

We target Beta(3,2) using a Uniform proposal on [0,1]. We compute normalized importance weights proportional to x^2(1-x), estimate E[X] with a weighted average, and then perform systematic resampling to obtain an unweighted sample set. Both estimates should be close to the true mean 0.6, and ESS quantifies weight concentration.

Time: O(N)Space: O(N)
#importance sampling#proposal distribution#self-normalized#effective sample size#variance reduction#rare event#log-sum-exp#resampling#monte carlo#bayesian inference#particle filter#optimal proposal#weight degeneracy#gaussian tail#c++ implementation