šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
šŸ“šTheoryAdvanced

Energy-Based Models (EBM)

Key Points

  • •
    Energy-Based Models (EBMs) define probabilities through an energy landscape: low energy means high probability, with p(x) = exp(-E(x)) / Z.
  • •
    The partition function Z is a normalizing constant obtained by integrating exp(-E(x)) over all x, and it is usually intractable to compute exactly in high dimensions.
  • •
    Training EBMs by maximum likelihood leads to a positive phase (lowering energy on data) and a negative phase (raising energy on model samples).
  • •
    Sampling from EBMs typically requires Markov Chain Monte Carlo (MCMC) methods such as Langevin dynamics or Metropolis–Hastings.
  • •
    The score āˆ‡_x log p(x) equals āˆ’āˆ‡_x E(x), which is useful for sampling and score-matching training objectives.
  • •
    Numerical normalization and exact sampling are feasible in 1D but become impractical in higher dimensions, where MCMC is preferred.
  • •
    Properly shaping the energy (e.g., ensuring it grows to +āˆž) is critical so that Z is finite and the model defines a valid probability distribution.
  • •
    In C++, you can implement toy EBMs, train them with contrastive divergence, and sample using short-run Langevin dynamics to visualize learned distributions.

Prerequisites

  • →Probability density and distributions — Understanding densities, normalization, and expectations is essential for interpreting p(x) and Z.
  • →Multivariable calculus — Gradients with respect to parameters and inputs (āˆ‡ĪøE and āˆ‡xE) are central to training and sampling.
  • →Markov Chain Monte Carlo (MCMC) basics — Sampling from EBMs typically uses Langevin dynamics or Metropolis–Hastings.
  • →Optimization and stochastic gradient descent — Maximum likelihood training of EBMs relies on iterative gradient-based updates.
  • →Numerical integration — Approximating Z in low dimensions uses quadrature such as Simpson’s rule.
  • →C++ programming and random number generation — Implementing EBMs requires careful use of RNGs, loops, and numeric stability.
  • →Linear algebra (optional for higher dimensions) — Vector gradients and energies like quadratic forms appear in multivariate EBMs.
  • →Basic statistics — Evaluating sample quality with means, variances, and histograms helps diagnose models.

Detailed Explanation

Tap terms for definitions

01Overview

Energy-Based Models (EBMs) are probabilistic models that assign an energy score to each configuration x. Instead of directly modeling the probability density p(x), EBMs define an unnormalized density using an energy function E(x; Īø) so that states with low energy are more likely. The normalized probability is obtained by exponentiating the negative energy and dividing by a normalizing constant, the partition function Z(Īø), which sums or integrates over all possible states. This framework unifies many models from statistical physics and machine learning: the lower the energy, the more probable the state. EBMs are appealing because you can specify the structure of the energy function without needing to express Z explicitly. However, computing Z or its gradient is generally intractable in high dimensions, so EBMs rely on approximate training and sampling techniques such as MCMC (e.g., Langevin dynamics) and contrastive divergence. EBMs can represent complicated, multi-modal distributions and can be extended to conditional settings (energies over pairs (x, y)). They are used for generative modeling, out-of-distribution detection, and structured prediction. The core trade-off is flexibility versus computational cost: you gain expressive power at the price of more challenging normalization and sampling.

02Intuition & Analogies

Imagine a mountain landscape where each location x corresponds to a possible data point. The elevation at x is the energy E(x): deep valleys correspond to likely data (low energy), and steep mountains correspond to unlikely data (high energy). If you dropped a marble at a random spot and let gravity plus a little random shaking move it around, it would spend most of its time in valleys. That behavior mirrors sampling from p(x) āˆ exp(āˆ’E(x)): low energy regions get visited more because they have higher probability. The partition function Z is like the total ā€œvolumeā€ of all the valleys after exponentiating the negative elevations; it ensures probabilities sum (or integrate) to 1. In low dimensions, you could carefully measure this total volume by gridding the terrain, but in high dimensions the terrain is too vast to scan exactly. Instead, you explore it with guided random walks (MCMC), nudging the marble downhill (following āˆ’āˆ‡E) while adding some noise so it doesn’t get stuck. Training an EBM is like reshaping the terrain: you press down to deepen valleys where your real data lives (positive phase) and push up to raise terrain where your model tends to wander but where real data is scarce (negative phase). If you press uniformly everywhere, probabilities don’t change because Z adjusts. If the terrain doesn’t rise to infinity as you move far away, the total volume becomes infinite and probabilities can’t be normalized—so a good E(x) must go to +āˆž at the edges. EBMs thus boil down to designing a meaningful landscape and learning to sculpt it using samples.

03Formal Definition

An Energy-Based Model specifies a non-negative density on a measurable space X through an energy function E_Īø: X → R. The model distribution is defined as p_Īø(x) = Z(Īø)exp(āˆ’Eθ​(x))​, where the partition function Z(Īø) = ∫X​ exp(-E_Īø(x)) \, dx for continuous X, or Z(Īø) = āˆ‘x∈X​ exp(-E_Īø(x)) for discrete X. Maximum likelihood estimation (MLE) maximizes the log-likelihood ā„“(Īø) = āˆ‘i=1n​ log p_Īø(xi​) = -āˆ‘i=1n​ E_Īø(xi​) - n log Z(Īø). Its gradient decomposes into a positive phase and a negative phase: āˆ‡_Īø ā„“(Īø) = -Edata​[āˆ‡_Īø E_Īø(x)] + Epθ​​[āˆ‡_Īø E_Īø(x)]. Sampling from p_Īø often uses Markov chains with stationary distribution p_Īø, such as overdamped Langevin dynamics: xt+1​=xt​ - ε \, āˆ‡x​ E_Īø(xt​) + 2ε​ \, ξt​, where ξt​ ∼ N(0, I). Alternative training includes score matching, which leverages that āˆ‡x​ log p_Īø(x) = -āˆ‡x​ E_Īø(x), bypassing Z(Īø) in the objective. For validity, E_Īø(x) must ensure Z(Īø) < āˆž (e.g., coercive growth E(x) → +āˆž as \∣x∄ → āˆž).

04When to Use

Use EBMs when you want flexible generative models that can represent complex, multi-modal distributions without committing to a tractable normalizer. They are useful when scoring configurations is easier than normalizing, such as in image modeling where a neural network can produce an energy for a pixel array. EBMs are appropriate when you can afford approximate inference via MCMC and when obtaining exact likelihoods is less critical than generating plausible samples or ranking inputs by energy. They shine in out-of-distribution detection: energies can act as anomaly scores, with higher energies indicating atypical inputs. EBMs also work well for structured prediction by defining energies over pairs (x, y), letting you perform prediction by searching for y that minimizes E(x, y). In low dimensions (e.g., 1D), EBMs allow exact or high-accuracy normalization by numerical integration, which is educational for understanding Z. Choose EBMs if you can design an energy landscape that encodes prior knowledge and if your application can tolerate the computational cost of sampling during training and inference. Avoid EBMs when you need exact, fast likelihoods in high dimensions or when MCMC mixing is prohibitively slow.

āš ļøCommon Mistakes

  • Ignoring normalizability: Choosing E(x) that does not grow to +āˆž causes Z = \infty and invalid probabilities. Include stabilizing terms (e.g., \lambda |x|^4) or constrain parameters so energy rises at infinity.
  • Sign errors in gradients: The MLE gradient has positive and negative phases. Confusing signs leads to pushing probability mass in the wrong direction. Double-check: \nabla_\theta L = \mathbb{E}{\text{data}}[\nabla\theta E] - \mathbb{E}{p\theta}[\nabla_\theta E].
  • Poor MCMC settings: Step sizes too large explode; too small under-mix. Always include noise in Langevin (\sqrt{2\varepsilon},\xi) and use burn-in. Monitor acceptance if using Metropolis–Hastings.
  • Collapsing parameterizations: Including a free constant offset c in E can drift because it cancels in p(x). Fix the offset or regularize to prevent ill-conditioning.
  • Estimating Z naively in high dimensions: Simple Monte Carlo with poor proposals has huge variance. Prefer annealed importance sampling or avoid Z by score matching or contrastive divergence.
  • Short-run CD misuse: Very few MCMC steps can bias learning; use persistent chains or increase steps gradually.
  • Evaluating only training loss: Also check sample quality, mixing diagnostics, and energy histograms for data vs. model samples.
  • Overfitting energy wells: Strongly overfitted energies create spiky distributions that are hard to sample. Use regularization and noise robustness tests.

Key Formulas

EBM density

pθ​(x)=Z(Īø)exp(āˆ’Eθ​(x))​

Explanation: The probability of x is the exponential of negative energy divided by the partition function. Lower energy implies higher probability after normalization.

Partition function (continuous)

Z(Īø)=∫X​exp(āˆ’Eθ​(x))dx

Explanation: The partition function is the integral of the unnormalized density over all x. It ensures probabilities integrate to one.

Log-likelihood

ā„“(Īø)=i=1āˆ‘n​logpθ​(xi​)=āˆ’i=1āˆ‘n​Eθ​(xi​)āˆ’nlogZ(Īø)

Explanation: Maximum likelihood increases probability on data by lowering their energies and accounting for the normalization through log Z.

Positive/negative phase gradient

āˆ‡Īøā€‹ā„“(Īø)=āˆ’Edata​[āˆ‡Īøā€‹Eθ​(x)]+Epθ​​[āˆ‡Īøā€‹Eθ​(x)]

Explanation: The gradient splits into a term that encourages low energy on data and a term that discourages low energy where the model places mass.

NLL gradient

L(Īø)=āˆ’ā„“(Īø)ā‡’āˆ‡Īøā€‹L(Īø)=Edata​[āˆ‡Īøā€‹Eθ​(x)]āˆ’Epθ​​[āˆ‡Īøā€‹Eθ​(x)]

Explanation: When minimizing negative log-likelihood, the sign flips: descend by reducing data energy and increasing model-sample energy.

Overdamped Langevin update

xt+1​=xtā€‹āˆ’Īµāˆ‡x​Eθ​(xt​)+2ε​ξt​,ξtā€‹āˆ¼N(0,I)

Explanation: A sampling step that moves downhill in energy while adding Gaussian noise for exploration. The stationary distribution is the EBM density.

Metropolis–Hastings acceptance

α=min(1,exp(āˆ’E(x))q(xā€²āˆ£x)exp(āˆ’E(x′))q(x∣x′)​)

Explanation: A new state x′ is accepted with this probability to maintain detailed balance. If proposals are symmetric, the q terms cancel.

Score of an EBM

āˆ‡x​logpθ​(x)=āˆ’āˆ‡x​Eθ​(x)

Explanation: The gradient of log-density with respect to x is just negative energy gradient because Z does not depend on x.

Score matching objective

J(Īø)=Epdata​​[21ā€‹āˆ„āˆ‡x​logpθ​(x)∄2+Ī”x​logpθ​(x)]=Epdata​​[21ā€‹āˆ„āˆ‡x​Eθ​(x)∄2āˆ’Ī”x​Eθ​(x)]+const

Explanation: Score matching fits the model’s score to the data’s score without computing Z. It replaces log-density derivatives with those of the energy.

Importance sampling for Z

Z(Īø)=Eq​[q(X)exp(āˆ’Eθ​(X))​]

Explanation: The partition function can be estimated by sampling from a proposal q(x) and averaging weighted terms. Good proposals are critical for low variance.

KL divergence

DKL​(pdataā€‹āˆ„pθ​)=Epdata​​[logpdata​(x)āˆ’logpθ​(x)]

Explanation: Maximum likelihood minimizes this divergence, aligning the model with the data distribution. The constant term E[log p_{data}] does not affect optimization.

Complexity Analysis

The main computational bottleneck of EBMs is approximating the negative phase expectation. For maximum likelihood with contrastive divergence or persistent CD, each training step typically requires K MCMC transitions per negative sample. If evaluating the energy and its gradient with respect to x takes O(CE​) time (e.g., O(d) for a simple polynomial in d dimensions or proportional to neural network inference cost), then generating M negative samples for B batches per epoch costs O(epochs Ɨ B Ɨ M Ɨ K Ɨ CE​). For toy 1D models with small K (10–50), this is modest, but in high dimensions with neural energies, CE​ can dominate. Memory usage during sampling is O(M Ɨ d) to store current states of the chains plus O(P) for parameters, where P is the number of model parameters. Computing the partition function Z exactly requires integrating exp(āˆ’E(x)) over the entire domain. In 1D, numerical quadrature with N grid points costs O(N Ɨ CE​) time and O(N) memory, which is feasible for N in the thousands. In higher dimensions, naive quadrature grows exponentially with dimension (the curse of dimensionality), making it intractable. Importance sampling can estimate Z with O(M Ɨ CE​) but may have very high variance unless the proposal q is well matched to the target. Langevin dynamics has O(K Ɨ CE​) per sample and mixes well near modes but may struggle to traverse widely separated modes without tempered transitions. Metropolis–Hastings adds an acceptance step with negligible overhead if proposal densities are easy to compute. In C++ implementations, the constant factors include random number generation and math library calls (exp, pow). Optimizations such as vectorization, caching gradients, and using persistent chains reduce overhead. Overall, EBMs trade normalization difficulty (Z) for flexible modeling, with time dominated by repeated energy/gradient evaluations during MCMC-based training and sampling.

Code Examples

1D EBM training with Contrastive Divergence and Langevin sampling
1#include <bits/stdc++.h>
2using namespace std;
3
4// Toy 1D EBM with energy E(x) = a*x^2 + b*x + lambda*x^4, where lambda = exp(gamma) > 0
5struct EBM1D {
6 double a; // quadratic coefficient
7 double b; // linear coefficient
8 double gamma; // log of lambda to keep lambda > 0
9
10 EBM1D(double a_=0.1, double b_=0.0, double gamma_=-1.0) : a(a_), b(b_), gamma(gamma_) {}
11
12 inline double lambda() const { return exp(gamma); }
13
14 // Energy function E(x)
15 inline double energy(double x) const {
16 double l = lambda();
17 return a*x*x + b*x + l*pow(x, 4);
18 }
19
20 // dE/dx used in Langevin dynamics
21 inline double dEdx(double x) const {
22 double l = lambda();
23 return 2.0*a*x + b + 4.0*l*pow(x, 3);
24 }
25
26 // Gradients of E wrt parameters at a single x
27 // dE/da = x^2, dE/db = x, dE/dgamma = dE/dlambda * dlambda/dgamma = x^4 * lambda
28 inline array<double,3> grad_params(double x) const {
29 double l = lambda();
30 return {x*x, x, l*pow(x,4)};
31 }
32};
33
34// Gaussian RNG helper
35struct RNG {
36 mt19937_64 gen;
37 normal_distribution<double> gauss;
38 uniform_real_distribution<double> uni;
39 RNG(uint64_t seed=42) : gen(seed), gauss(0.0,1.0), uni(0.0,1.0) {}
40 inline double randn() { return gauss(gen); }
41 inline double randu() { return uni(gen); }
42};
43
44// One step of overdamped Langevin dynamics: x <- x - eps * dE/dx + sqrt(2*eps) * N(0,1)
45inline double langevin_step(const EBM1D& model, double x, double eps, RNG &rng) {
46 double noise = sqrt(2.0*eps) * rng.randn();
47 return x - eps * model.dEdx(x) + noise;
48}
49
50// Generate synthetic data: mixture of two Gaussians
51vector<double> make_data(size_t N, RNG &rng) {
52 vector<double> data;
53 data.reserve(N);
54 for (size_t i=0;i<N;++i) {
55 // 50/50 mixture of N(-2, 0.5^2) and N(2, 0.5^2)
56 bool left = (rng.randu() < 0.5);
57 double mean = left ? -2.0 : 2.0;
58 double stdv = 0.5;
59 data.push_back(mean + stdv * rng.randn());
60 }
61 return data;
62}
63
64// Train with Persistent Contrastive Divergence (PCD)
65void train_ebm(EBM1D &model, const vector<double> &data, int epochs, int batch_size,
66 int K, double eps, double lr, RNG &rng) {
67 // initialize a pool of negative samples (persistent chains)
68 vector<double> neg_pool(batch_size);
69 for (int i=0;i<batch_size;++i) neg_pool[i] = rng.randn();
70
71 size_t N = data.size();
72 vector<int> indices(N);
73 iota(indices.begin(), indices.end(), 0);
74
75 for (int epoch=1; epoch<=epochs; ++epoch) {
76 shuffle(indices.begin(), indices.end(), rng.gen);
77 double loss_proxy = 0.0; // average E(data) - E(model_sample)
78
79 for (size_t start=0; start<N; start+=batch_size) {
80 size_t end = min(start + (size_t)batch_size, N);
81 int bs = (int)(end - start);
82
83 // Positive phase: gradients on data
84 array<double,3> pos = {0.0,0.0,0.0};
85 for (int i=0;i<bs;++i) {
86 double x = data[ indices[start + i] ];
87 auto g = model.grad_params(x);
88 pos[0] += g[0]; pos[1] += g[1]; pos[2] += g[2];
89 loss_proxy += model.energy(x);
90 }
91 for (int j=0;j<3;++j) pos[j] /= bs;
92
93 // Negative phase: update persistent chains and compute gradients
94 array<double,3> neg = {0.0,0.0,0.0};
95 for (int i=0;i<bs;++i) {
96 double x = neg_pool[i];
97 for (int t=0; t<K; ++t) x = langevin_step(model, x, eps, rng);
98 neg_pool[i] = x; // persist
99 auto g = model.grad_params(x);
100 neg[0] += g[0]; neg[1] += g[1]; neg[2] += g[2];
101 loss_proxy -= model.energy(x);
102 }
103 for (int j=0;j<3;++j) neg[j] /= bs;
104
105 // Gradient of NLL: pos - neg
106 model.a -= lr * (pos[0] - neg[0]);
107 model.b -= lr * (pos[1] - neg[1]);
108 model.gamma -= lr * (pos[2] - neg[2]); // keeps lambda=exp(gamma)>0
109 }
110
111 if (epoch % 10 == 0) {
112 cout << "Epoch " << epoch
113 << " | a=" << model.a
114 << " b=" << model.b
115 << " lambda=" << model.lambda()
116 << " | loss_proxy=" << (loss_proxy / N) << "\n";
117 }
118 }
119}
120
121// Sample from the trained model via Langevin and report mean/variance
122pair<double,double> sample_stats(const EBM1D &model, int num_samples, int burn_in, int thinning, RNG &rng) {
123 double x = rng.randn();
124 // burn-in
125 double eps = 0.01;
126 for (int i=0;i<burn_in; ++i) x = langevin_step(model, x, eps, rng);
127
128 double mean=0.0, m2=0.0; int count=0;
129 for (int i=0;i<num_samples*thinning; ++i) {
130 x = langevin_step(model, x, eps, rng);
131 if ((i+1)%thinning==0) {
132 ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
133 }
134 }
135 double var = (count>1) ? (m2 / (count-1)) : 0.0;
136 return {mean, var};
137}
138
139int main() {
140 ios::sync_with_stdio(false);
141 cin.tie(nullptr);
142
143 RNG rng(123);
144
145 // 1) Create synthetic dataset
146 auto data = make_data(5000, rng);
147
148 // 2) Initialize model
149 EBM1D model(0.05, 0.0, -0.5); // a, b, gamma (lambda=exp(gamma))
150
151 // 3) Train with PCD + Langevin
152 int epochs = 100;
153 int batch_size = 256;
154 int K = 30; // Langevin steps per update
155 double eps = 0.01; // Langevin step size
156 double lr = 1e-3; // learning rate
157
158 train_ebm(model, data, epochs, batch_size, K, eps, lr, rng);
159
160 // 4) Sample from trained model and print stats
161 auto [mean, var] = sample_stats(model, 5000, /*burn_in=*/2000, /*thinning=*/5, rng);
162
163 // Compute empirical data stats for comparison
164 double dmean=0.0, dvar=0.0; for (double x : data) dmean += x; dmean /= data.size();
165 for (double x : data) dvar += (x-dmean)*(x-dmean); dvar /= (double)(data.size()-1);
166
167 cout << fixed << setprecision(4);
168 cout << "Trained params: a=" << model.a << ", b=" << model.b << ", lambda=" << model.lambda() << "\n";
169 cout << "Data mean=" << dmean << ", var=" << dvar << "\n";
170 cout << "Model samples mean=" << mean << ", var=" << var << "\n";
171 return 0;
172}
173

This program trains a simple 1D EBM with a stabilizing quartic term so that the model is normalizable. It uses Persistent Contrastive Divergence (PCD): a pool of negative samples is updated for K Langevin steps per iteration to approximate the model expectation. The gradient of the negative log-likelihood is computed as E_data[āˆ‚E/āˆ‚Īø] āˆ’ E_model[āˆ‚E/āˆ‚Īø], and parameters are updated with SGD. After training, we draw Langevin samples and report mean and variance to compare with the synthetic data. The code highlights the positive/negative phase split, Langevin sampling, and the importance of making the energy grow at infinity via Ī»x^4.

Time: O(epochs Ɨ (N/batch_size) Ɨ batch_size Ɨ K) ā‰ˆ O(epochs Ɨ N Ɨ K) for fixed evaluation cost per stepSpace: O(batch_size) for persistent chains plus O(1) for parameters
1D EBM: Numerical partition function and exact sampling by inverse CDF on a grid
1#include <bits/stdc++.h>
2using namespace std;
3
4// Define a normalizable 1D energy function, e.g., E(x) = 0.5*x^2 + 0.1*x^4
5struct Energy1D {
6 inline double operator()(double x) const {
7 return 0.5*x*x + 0.1*pow(x,4);
8 }
9};
10
11// Simpson's rule integration over [-L, L] with N even intervals
12// Returns approximation of integral f(x) dx for f = exp(-E(x))
13double simpson_partition(const Energy1D &E, double L, int N) {
14 if (N % 2 == 1) ++N; // make even
15 double h = (2.0*L) / N;
16 auto f = [&](double x){ return exp(-E(x)); };
17 double sum = f(-L) + f(L);
18 for (int i=1; i<N; ++i) {
19 double x = -L + i*h;
20 sum += (i%2 ? 4.0 : 2.0) * f(x);
21 }
22 return sum * h / 3.0;
23}
24
25// Build PDF and CDF on a grid and sample by inverse transform (binary search)
26struct GridSampler {
27 vector<double> xs, pdf, cdf;
28 double L; int N; double Z;
29
30 GridSampler(const Energy1D &E, double L_, int N_) : L(L_), N(N_) {
31 if (N % 2 == 1) ++N; // even for Simpson
32 xs.resize(N+1); pdf.resize(N+1); cdf.resize(N+1);
33 double h = (2.0*L) / N;
34 for (int i=0;i<=N;++i) xs[i] = -L + i*h;
35 // approximate Z by Simpson on this grid
36 Z = 0.0;
37 auto f = [&](double x){ return exp(-E(x)); };
38 // Simpson weights
39 double sum = f(xs[0]) + f(xs[N]);
40 for (int i=1; i<N; ++i) sum += (i%2?4.0:2.0)*f(xs[i]);
41 Z = sum * h / 3.0;
42 // fill normalized pdf and cdf (trapezoidal accumulation)
43 for (int i=0;i<=N;++i) pdf[i] = f(xs[i]) / Z;
44 cdf[0] = 0.0;
45 for (int i=1;i<=N;++i) {
46 // trapezoid between xs[i-1] and xs[i]
47 cdf[i] = cdf[i-1] + 0.5*(pdf[i-1] + pdf[i])*(xs[i]-xs[i-1]);
48 }
49 // normalize any numerical drift
50 for (int i=0;i<=N;++i) cdf[i] /= cdf[N];
51 }
52
53 template<class URNG>
54 double sample(URNG &rng) const {
55 uniform_real_distribution<double> U(0.0,1.0);
56 double u = U(rng);
57 // binary search in cdf
58 int lo = 0, hi = N;
59 while (lo < hi) {
60 int mid = (lo + hi) / 2;
61 if (cdf[mid] < u) lo = mid + 1; else hi = mid;
62 }
63 int i = max(1, lo);
64 // linear interpolate within the bin
65 double x0 = xs[i-1], x1 = xs[i];
66 double c0 = cdf[i-1], c1 = cdf[i];
67 double t = (u - c0) / max(1e-12, (c1 - c0));
68 return x0 + t * (x1 - x0);
69 }
70};
71
72int main(){
73 ios::sync_with_stdio(false);
74 cin.tie(nullptr);
75
76 Energy1D E;
77 double L = 6.0; // integration domain [-L, L]
78 int N = 4000; // grid resolution (even)
79
80 double Z = simpson_partition(E, L, N);
81 cout << fixed << setprecision(6);
82 cout << "Approximate partition function Z = " << Z << " on [-" << L << ", " << L << "] with N=" << N << "\n";
83
84 // Build sampler and draw samples
85 GridSampler sampler(E, L, N);
86 mt19937_64 rng(123);
87 int M = 100000;
88 double mean=0.0, m2=0.0; int count=0;
89 for (int i=0;i<M;++i){
90 double x = sampler.sample(rng);
91 ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
92 }
93 double var = m2 / (count-1);
94 cout << "Samples: mean=" << mean << ", var=" << var << " (exact normalization via grid)\n";
95
96 // Print a few pdf values
97 cout << "p(0) approx = " << exp(-E(0.0))/Z << "\n";
98 cout << "p(1) approx = " << exp(-E(1.0))/Z << "\n";
99 cout << "p(2) approx = " << exp(-E(2.0))/Z << "\n";
100
101 return 0;
102}
103

This example illustrates the partition function and exact sampling in 1D. We numerically integrate exp(āˆ’E(x)) with Simpson’s rule over [āˆ’L, L] to estimate Z and then construct a grid-based PDF/CDF. We sample by inverse transform (binary search + interpolation), which yields near-exact samples in 1D. This is educational for visualizing normalization and probabilities; it does not scale to high dimensions but reinforces the definition p(x) = exp(āˆ’E(x))/Z.

Time: O(N) to build the grid and compute Z; O(log N) per sample due to binary searchSpace: O(N) to store grid, PDF, and CDF
#energy-based models#partition function#langevin dynamics#contrastive divergence#markov chain monte carlo#score matching#boltzmann distribution#unnormalized density#metropolis-hastings#numerical integration#normalizing constant#persistent cd#annealed importance sampling#log-likelihood#probabilistic modeling