📚TheoryAdvanced

Energy-Based Models (EBM)

Key Points

•
Energy-Based Models (EBMs) define probabilities through an energy landscape: low energy means high probability, with p(x) = exp(-E(x)) / Z.
•
The partition function Z is a normalizing constant obtained by integrating exp(-E(x)) over all x, and it is usually intractable to compute exactly in high dimensions.
•
Training EBMs by maximum likelihood leads to a positive phase (lowering energy on data) and a negative phase (raising energy on model samples).
•
Sampling from EBMs typically requires Markov Chain Monte Carlo (MCMC) methods such as Langevin dynamics or Metropolis–Hastings.
•
The score ∇_x log p(x) equals −∇_x E(x), which is useful for sampling and score-matching training objectives.
•
Numerical normalization and exact sampling are feasible in 1D but become impractical in higher dimensions, where MCMC is preferred.
•
Properly shaping the energy (e.g., ensuring it grows to +∞) is critical so that Z is finite and the model defines a valid probability distribution.
•
In C++, you can implement toy EBMs, train them with contrastive divergence, and sample using short-run Langevin dynamics to visualize learned distributions.

Prerequisites

→Probability density and distributions — Understanding densities, normalization, and expectations is essential for interpreting p(x) and Z.
→Multivariable calculus — Gradients with respect to parameters and inputs (∇θE and ∇xE) are central to training and sampling.
→Markov Chain Monte Carlo (MCMC) basics — Sampling from EBMs typically uses Langevin dynamics or Metropolis–Hastings.
→Optimization and stochastic gradient descent — Maximum likelihood training of EBMs relies on iterative gradient-based updates.
→Numerical integration — Approximating Z in low dimensions uses quadrature such as Simpson’s rule.
→C++ programming and random number generation — Implementing EBMs requires careful use of RNGs, loops, and numeric stability.
→Linear algebra (optional for higher dimensions) — Vector gradients and energies like quadratic forms appear in multivariate EBMs.
→Basic statistics — Evaluating sample quality with means, variances, and histograms helps diagnose models.

Detailed Explanation

Tap terms for definitions

01Overview

Energy-Based Models (EBMs) are probabilistic models that assign an energy score to each configuration x. Instead of directly modeling the probability density p(x), EBMs define an unnormalized density using an energy function E(x; θ) so that states with low energy are more likely. The normalized probability is obtained by exponentiating the negative energy and dividing by a normalizing constant, the partition function Z(θ), which sums or integrates over all possible states. This framework unifies many models from statistical physics and machine learning: the lower the energy, the more probable the state. EBMs are appealing because you can specify the structure of the energy function without needing to express Z explicitly. However, computing Z or its gradient is generally intractable in high dimensions, so EBMs rely on approximate training and sampling techniques such as MCMC (e.g., Langevin dynamics) and contrastive divergence. EBMs can represent complicated, multi-modal distributions and can be extended to conditional settings (energies over pairs (x, y)). They are used for generative modeling, out-of-distribution detection, and structured prediction. The core trade-off is flexibility versus computational cost: you gain expressive power at the price of more challenging normalization and sampling.

02Intuition & Analogies

Imagine a mountain landscape where each location x corresponds to a possible data point. The elevation at x is the energy E(x): deep valleys correspond to likely data (low energy), and steep mountains correspond to unlikely data (high energy). If you dropped a marble at a random spot and let gravity plus a little random shaking move it around, it would spend most of its time in valleys. That behavior mirrors sampling from p(x) ∝ exp(−E(x)): low energy regions get visited more because they have higher probability. The partition function Z is like the total “volume” of all the valleys after exponentiating the negative elevations; it ensures probabilities sum (or integrate) to 1. In low dimensions, you could carefully measure this total volume by gridding the terrain, but in high dimensions the terrain is too vast to scan exactly. Instead, you explore it with guided random walks (MCMC), nudging the marble downhill (following −∇E) while adding some noise so it doesn’t get stuck. Training an EBM is like reshaping the terrain: you press down to deepen valleys where your real data lives (positive phase) and push up to raise terrain where your model tends to wander but where real data is scarce (negative phase). If you press uniformly everywhere, probabilities don’t change because Z adjusts. If the terrain doesn’t rise to infinity as you move far away, the total volume becomes infinite and probabilities can’t be normalized—so a good E(x) must go to +∞ at the edges. EBMs thus boil down to designing a meaningful landscape and learning to sculpt it using samples.

03Formal Definition

An Energy-Based Model specifies a non-negative density on a measurable space

X

through an energy function E_

θ

X

\to

R

. The model distribution is defined as p_

θ

(x) =

\frac{e x p ( - E _{θ} ( x ))}{Z ( θ )}

, where the partition function Z(

θ

) =

\int_{X}

exp

(-E_

θ

(x)) \, dx for continuous

X

, or Z(

θ

) =

\sum_{x \in X}

exp

(-E_

θ

(x)) for discrete

X

. Maximum likelihood estimation (MLE) maximizes the log-likelihood

ℓ

(

θ

) =

\sum_{i = 1}^{n}

lo g

θ

(

x_{i}

) = -

\sum_{i = 1}^{n}

θ

(

x_{i}

) - n

lo g

θ

). Its gradient decomposes into a positive phase and a negative phase:

\nabla

θ

ℓ

(

θ

) = -

E_{data}

[

\nabla

θ

θ

(x)] +

E_{p_{θ}}

[

\nabla

θ

θ

(x)]. Sampling from p_

θ

often uses Markov chains with stationary distribution p_

θ

, such as overdamped Langevin dynamics:

x_{t + 1} = x_{t}

ε

\nabla_{x}

θ

(

x_{t}

) +

2 ε

ξ_{t}

, where

ξ_{t}

\sim

N

(0, I). Alternative training includes score matching, which leverages that

\nabla_{x}

lo g

θ

(x) = -

\nabla_{x}

θ

(x), bypassing Z(

θ

) in the objective. For validity, E_

θ

(x) must ensure Z(

θ

) <

\infty

(e.g., coercive growth E(x)

\to

\infty

as \

∣ x ∥

\to

\infty

04When to Use

Use EBMs when you want flexible generative models that can represent complex, multi-modal distributions without committing to a tractable normalizer. They are useful when scoring configurations is easier than normalizing, such as in image modeling where a neural network can produce an energy for a pixel array. EBMs are appropriate when you can afford approximate inference via MCMC and when obtaining exact likelihoods is less critical than generating plausible samples or ranking inputs by energy. They shine in out-of-distribution detection: energies can act as anomaly scores, with higher energies indicating atypical inputs. EBMs also work well for structured prediction by defining energies over pairs (x, y), letting you perform prediction by searching for y that minimizes E(x, y). In low dimensions (e.g., 1D), EBMs allow exact or high-accuracy normalization by numerical integration, which is educational for understanding Z. Choose EBMs if you can design an energy landscape that encodes prior knowledge and if your application can tolerate the computational cost of sampling during training and inference. Avoid EBMs when you need exact, fast likelihoods in high dimensions or when MCMC mixing is prohibitively slow.

⚠️Common Mistakes

Ignoring normalizability: Choosing E(x) that does not grow to +∞ causes Z = \infty and invalid probabilities. Include stabilizing terms (e.g., \lambda |x|^4) or constrain parameters so energy rises at infinity.
Sign errors in gradients: The MLE gradient has positive and negative phases. Confusing signs leads to pushing probability mass in the wrong direction. Double-check: \nabla_\theta L = \mathbb{E}{\text{data}}[\nabla\theta E] - \mathbb{E}{p\theta}[\nabla_\theta E].
Poor MCMC settings: Step sizes too large explode; too small under-mix. Always include noise in Langevin (\sqrt{2\varepsilon},\xi) and use burn-in. Monitor acceptance if using Metropolis–Hastings.
Collapsing parameterizations: Including a free constant offset c in E can drift because it cancels in p(x). Fix the offset or regularize to prevent ill-conditioning.
Estimating Z naively in high dimensions: Simple Monte Carlo with poor proposals has huge variance. Prefer annealed importance sampling or avoid Z by score matching or contrastive divergence.
Short-run CD misuse: Very few MCMC steps can bias learning; use persistent chains or increase steps gradually.
Evaluating only training loss: Also check sample quality, mixing diagnostics, and energy histograms for data vs. model samples.
Overfitting energy wells: Strongly overfitted energies create spiky distributions that are hard to sample. Use regularization and noise robustness tests.

Key Formulas

EBM density

p_{θ} (x) = \frac{exp ( - E _{θ} ( x ))}{Z ( θ )}

Explanation: The probability of x is the exponential of negative energy divided by the partition function. Lower energy implies higher probability after normalization.

Partition function (continuous)

Z (θ) = \int_{X} exp (- E_{θ} (x)) d x

Explanation: The partition function is the integral of the unnormalized density over all x. It ensures probabilities integrate to one.

Log-likelihood

ℓ (θ) = i = 1 \sum n lo g p_{θ} (x_{i}) = - i = 1 \sum n E_{θ} (x_{i}) - n lo g Z (θ)

Explanation: Maximum likelihood increases probability on data by lowering their energies and accounting for the normalization through log Z.

Positive/negative phase gradient

\nabla_{θ} ℓ (θ) = - E_{data} [\nabla_{θ} E_{θ} (x)] + E_{p_{θ}} [\nabla_{θ} E_{θ} (x)]

Explanation: The gradient splits into a term that encourages low energy on data and a term that discourages low energy where the model places mass.

NLL gradient

L (θ) = - ℓ (θ) \Rightarrow \nabla_{θ} L (θ) = E_{data} [\nabla_{θ} E_{θ} (x)] - E_{p_{θ}} [\nabla_{θ} E_{θ} (x)]

Explanation: When minimizing negative log-likelihood, the sign flips: descend by reducing data energy and increasing model-sample energy.

Overdamped Langevin update

x_{t + 1} = x_{t} - ε \nabla_{x} E_{θ} (x_{t}) + 2 ε ξ_{t}, ξ_{t} \sim N (0, I)

Explanation: A sampling step that moves downhill in energy while adding Gaussian noise for exploration. The stationary distribution is the EBM density.

Metropolis–Hastings acceptance

α = min (1, \frac{exp ( - E ( x ^{'} )) q ( x ∣ x ^{'} )}{exp ( - E ( x )) q ( x ^{'} ∣ x )})

Explanation: A new state x′ is accepted with this probability to maintain detailed balance. If proposals are symmetric, the q terms cancel.

Score of an EBM

\nabla_{x} lo g p_{θ} (x) = - \nabla_{x} E_{θ} (x)

Explanation: The gradient of log-density with respect to x is just negative energy gradient because Z does not depend on x.

Score matching objective

J (θ) = E_{p_{data}} [\frac{1}{2} ∥ \nabla_{x} lo g p_{θ} (x) ∥^{2} + Δ_{x} lo g p_{θ} (x)] = E_{p_{data}} [\frac{1}{2} ∥ \nabla_{x} E_{θ} (x) ∥^{2} - Δ_{x} E_{θ} (x)] + const

Explanation: Score matching fits the model’s score to the data’s score without computing Z. It replaces log-density derivatives with those of the energy.

Importance sampling for Z

Z (θ) = E_{q} [\frac{exp ( - E _{θ} ( X ))}{q ( X )}]

Explanation: The partition function can be estimated by sampling from a proposal q(x) and averaging weighted terms. Good proposals are critical for low variance.

KL divergence

D_{KL} (p_{data} ∥ p_{θ}) = E_{p_{data}} [lo g p_{data} (x) - lo g p_{θ} (x)]

Explanation: Maximum likelihood minimizes this divergence, aligning the model with the data distribution. The constant term $E$ [ $lo g$ p_{ $data$ }] does not affect optimization.

Complexity Analysis

The main computational bottleneck of EBMs is approximating the negative phase expectation. For maximum likelihood with contrastive divergence or persistent CD, each training step typically requires K MCMC transitions per negative sample. If evaluating the energy and its gradient with respect to x takes O(

C_{E}

) time (e.g., O(d) for a simple polynomial in d dimensions or proportional to neural network inference cost), then generating M negative samples for B batches per epoch costs O(epochs × B × M × K ×

C_{E}

). For toy 1D models with small K (10–50), this is modest, but in high dimensions with neural energies,

C_{E}

can dominate. Memory usage during sampling is O(M × d) to store current states of the chains plus O(P) for parameters, where P is the number of model parameters. Computing the partition function Z exactly requires integrating exp(−E(x)) over the entire domain. In 1D, numerical quadrature with N grid points costs O(N ×

C_{E}

) time and O(N) memory, which is feasible for N in the thousands. In higher dimensions, naive quadrature grows exponentially with dimension (the curse of dimensionality), making it intractable. Importance sampling can estimate Z with O(M ×

C_{E}

) but may have very high variance unless the proposal q is well matched to the target. Langevin dynamics has O(K ×

C_{E}

) per sample and mixes well near modes but may struggle to traverse widely separated modes without tempered transitions. Metropolis–Hastings adds an acceptance step with negligible overhead if proposal densities are easy to compute. In C++ implementations, the constant factors include random number generation and math library calls (exp, pow). Optimizations such as vectorization, caching gradients, and using persistent chains reduce overhead. Overall, EBMs trade normalization difficulty (Z) for flexible modeling, with time dominated by repeated energy/gradient evaluations during MCMC-based training and sampling.

Code Examples

1D EBM training with Contrastive Divergence and Langevin sampling

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Toy 1D EBM with energy E(x) = a*x^2 + b*x + lambda*x^4, where lambda = exp(gamma) > 0
5 struct EBM1D {
6     double a;    // quadratic coefficient
7     double b;    // linear coefficient
8     double gamma; // log of lambda to keep lambda > 0
9 
10     EBM1D(double a_=0.1, double b_=0.0, double gamma_=-1.0) : a(a_), b(b_), gamma(gamma_) {}
11 
12     inline double lambda() const { return exp(gamma); }
13 
14     // Energy function E(x)
15     inline double energy(double x) const {
16         double l = lambda();
17         return a*x*x + b*x + l*pow(x, 4);
18     }
19 
20     // dE/dx used in Langevin dynamics
21     inline double dEdx(double x) const {
22         double l = lambda();
23         return 2.0*a*x + b + 4.0*l*pow(x, 3);
24     }
25 
26     // Gradients of E wrt parameters at a single x
27     // dE/da = x^2, dE/db = x, dE/dgamma = dE/dlambda * dlambda/dgamma = x^4 * lambda
28     inline array<double,3> grad_params(double x) const {
29         double l = lambda();
30         return {x*x, x, l*pow(x,4)};
31     }
32 };
33 
34 // Gaussian RNG helper
35 struct RNG {
36     mt19937_64 gen;
37     normal_distribution<double> gauss;
38     uniform_real_distribution<double> uni;
39     RNG(uint64_t seed=42) : gen(seed), gauss(0.0,1.0), uni(0.0,1.0) {}
40     inline double randn() { return gauss(gen); }
41     inline double randu() { return uni(gen); }
42 };
43 
44 // One step of overdamped Langevin dynamics: x <- x - eps * dE/dx + sqrt(2*eps) * N(0,1)
45 inline double langevin_step(const EBM1D& model, double x, double eps, RNG &rng) {
46     double noise = sqrt(2.0*eps) * rng.randn();
47     return x - eps * model.dEdx(x) + noise;
48 }
49 
50 // Generate synthetic data: mixture of two Gaussians
51 vector<double> make_data(size_t N, RNG &rng) {
52     vector<double> data;
53     data.reserve(N);
54     for (size_t i=0;i<N;++i) {
55         // 50/50 mixture of N(-2, 0.5^2) and N(2, 0.5^2)
56         bool left = (rng.randu() < 0.5);
57         double mean = left ? -2.0 : 2.0;
58         double stdv = 0.5;
59         data.push_back(mean + stdv * rng.randn());
60     }
61     return data;
62 }
63 
64 // Train with Persistent Contrastive Divergence (PCD)
65 void train_ebm(EBM1D &model, const vector<double> &data, int epochs, int batch_size,
66                int K, double eps, double lr, RNG &rng) {
67     // initialize a pool of negative samples (persistent chains)
68     vector<double> neg_pool(batch_size);
69     for (int i=0;i<batch_size;++i) neg_pool[i] = rng.randn();
70 
71     size_t N = data.size();
72     vector<int> indices(N);
73     iota(indices.begin(), indices.end(), 0);
74 
75     for (int epoch=1; epoch<=epochs; ++epoch) {
76         shuffle(indices.begin(), indices.end(), rng.gen);
77         double loss_proxy = 0.0; // average E(data) - E(model_sample)
78 
79         for (size_t start=0; start<N; start+=batch_size) {
80             size_t end = min(start + (size_t)batch_size, N);
81             int bs = (int)(end - start);
82 
83             // Positive phase: gradients on data
84             array<double,3> pos = {0.0,0.0,0.0};
85             for (int i=0;i<bs;++i) {
86                 double x = data[ indices[start + i] ];
87                 auto g = model.grad_params(x);
88                 pos[0] += g[0]; pos[1] += g[1]; pos[2] += g[2];
89                 loss_proxy += model.energy(x);
90             }
91             for (int j=0;j<3;++j) pos[j] /= bs;
92 
93             // Negative phase: update persistent chains and compute gradients
94             array<double,3> neg = {0.0,0.0,0.0};
95             for (int i=0;i<bs;++i) {
96                 double x = neg_pool[i];
97                 for (int t=0; t<K; ++t) x = langevin_step(model, x, eps, rng);
98                 neg_pool[i] = x; // persist
99                 auto g = model.grad_params(x);
100                 neg[0] += g[0]; neg[1] += g[1]; neg[2] += g[2];
101                 loss_proxy -= model.energy(x);
102             }
103             for (int j=0;j<3;++j) neg[j] /= bs;
104 
105             // Gradient of NLL: pos - neg
106             model.a     -= lr * (pos[0] - neg[0]);
107             model.b     -= lr * (pos[1] - neg[1]);
108             model.gamma -= lr * (pos[2] - neg[2]); // keeps lambda=exp(gamma)>0
109         }
110 
111         if (epoch % 10 == 0) {
112             cout << "Epoch " << epoch
113                  << " | a=" << model.a
114                  << " b=" << model.b
115                  << " lambda=" << model.lambda()
116                  << " | loss_proxy=" << (loss_proxy / N) << "\n";
117         }
118     }
119 }
120 
121 // Sample from the trained model via Langevin and report mean/variance
122 pair<double,double> sample_stats(const EBM1D &model, int num_samples, int burn_in, int thinning, RNG &rng) {
123     double x = rng.randn();
124     // burn-in
125     double eps = 0.01;
126     for (int i=0;i<burn_in; ++i) x = langevin_step(model, x, eps, rng);
127 
128     double mean=0.0, m2=0.0; int count=0;
129     for (int i=0;i<num_samples*thinning; ++i) {
130         x = langevin_step(model, x, eps, rng);
131         if ((i+1)%thinning==0) {
132             ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
133         }
134     }
135     double var = (count>1) ? (m2 / (count-1)) : 0.0;
136     return {mean, var};
137 }
138 
139 int main() {
140     ios::sync_with_stdio(false);
141     cin.tie(nullptr);
142 
143     RNG rng(123);
144 
145     // 1) Create synthetic dataset
146     auto data = make_data(5000, rng);
147 
148     // 2) Initialize model
149     EBM1D model(0.05, 0.0, -0.5); // a, b, gamma (lambda=exp(gamma))
150 
151     // 3) Train with PCD + Langevin
152     int epochs = 100;
153     int batch_size = 256;
154     int K = 30;          // Langevin steps per update
155     double eps = 0.01;   // Langevin step size
156     double lr = 1e-3;    // learning rate
157 
158     train_ebm(model, data, epochs, batch_size, K, eps, lr, rng);
159 
160     // 4) Sample from trained model and print stats
161     auto [mean, var] = sample_stats(model, 5000, /*burn_in=*/2000, /*thinning=*/5, rng);
162 
163     // Compute empirical data stats for comparison
164     double dmean=0.0, dvar=0.0; for (double x : data) dmean += x; dmean /= data.size();
165     for (double x : data) dvar += (x-dmean)*(x-dmean); dvar /= (double)(data.size()-1);
166 
167     cout << fixed << setprecision(4);
168     cout << "Trained params: a=" << model.a << ", b=" << model.b << ", lambda=" << model.lambda() << "\n";
169     cout << "Data mean=" << dmean << ", var=" << dvar << "\n";
170     cout << "Model samples mean=" << mean << ", var=" << var << "\n";
171     return 0;
172 }
173

This program trains a simple 1D EBM with a stabilizing quartic term so that the model is normalizable. It uses Persistent Contrastive Divergence (PCD): a pool of negative samples is updated for K Langevin steps per iteration to approximate the model expectation. The gradient of the negative log-likelihood is computed as E_data[∂E/∂θ] − E_model[∂E/∂θ], and parameters are updated with SGD. After training, we draw Langevin samples and report mean and variance to compare with the synthetic data. The code highlights the positive/negative phase split, Langevin sampling, and the importance of making the energy grow at infinity via λx^4.

Time: O(epochs × (N/batch_size) × batch_size × K) ≈ O(epochs × N × K) for fixed evaluation cost per stepSpace: O(batch_size) for persistent chains plus O(1) for parameters

1D EBM: Numerical partition function and exact sampling by inverse CDF on a grid

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Define a normalizable 1D energy function, e.g., E(x) = 0.5*x^2 + 0.1*x^4
5 struct Energy1D {
6     inline double operator()(double x) const {
7         return 0.5*x*x + 0.1*pow(x,4);
8     }
9 };
10 
11 // Simpson's rule integration over [-L, L] with N even intervals
12 // Returns approximation of integral f(x) dx for f = exp(-E(x))
13 double simpson_partition(const Energy1D &E, double L, int N) {
14     if (N % 2 == 1) ++N; // make even
15     double h = (2.0*L) / N;
16     auto f = [&](double x){ return exp(-E(x)); };
17     double sum = f(-L) + f(L);
18     for (int i=1; i<N; ++i) {
19         double x = -L + i*h;
20         sum += (i%2 ? 4.0 : 2.0) * f(x);
21     }
22     return sum * h / 3.0;
23 }
24 
25 // Build PDF and CDF on a grid and sample by inverse transform (binary search)
26 struct GridSampler {
27     vector<double> xs, pdf, cdf;
28     double L; int N; double Z;
29 
30     GridSampler(const Energy1D &E, double L_, int N_) : L(L_), N(N_) {
31         if (N % 2 == 1) ++N; // even for Simpson
32         xs.resize(N+1); pdf.resize(N+1); cdf.resize(N+1);
33         double h = (2.0*L) / N;
34         for (int i=0;i<=N;++i) xs[i] = -L + i*h;
35         // approximate Z by Simpson on this grid
36         Z = 0.0;
37         auto f = [&](double x){ return exp(-E(x)); };
38         // Simpson weights
39         double sum = f(xs[0]) + f(xs[N]);
40         for (int i=1; i<N; ++i) sum += (i%2?4.0:2.0)*f(xs[i]);
41         Z = sum * h / 3.0;
42         // fill normalized pdf and cdf (trapezoidal accumulation)
43         for (int i=0;i<=N;++i) pdf[i] = f(xs[i]) / Z;
44         cdf[0] = 0.0;
45         for (int i=1;i<=N;++i) {
46             // trapezoid between xs[i-1] and xs[i]
47             cdf[i] = cdf[i-1] + 0.5*(pdf[i-1] + pdf[i])*(xs[i]-xs[i-1]);
48         }
49         // normalize any numerical drift
50         for (int i=0;i<=N;++i) cdf[i] /= cdf[N];
51     }
52 
53     template<class URNG>
54     double sample(URNG &rng) const {
55         uniform_real_distribution<double> U(0.0,1.0);
56         double u = U(rng);
57         // binary search in cdf
58         int lo = 0, hi = N;
59         while (lo < hi) {
60             int mid = (lo + hi) / 2;
61             if (cdf[mid] < u) lo = mid + 1; else hi = mid;
62         }
63         int i = max(1, lo);
64         // linear interpolate within the bin
65         double x0 = xs[i-1], x1 = xs[i];
66         double c0 = cdf[i-1], c1 = cdf[i];
67         double t = (u - c0) / max(1e-12, (c1 - c0));
68         return x0 + t * (x1 - x0);
69     }
70 };
71 
72 int main(){
73     ios::sync_with_stdio(false);
74     cin.tie(nullptr);
75 
76     Energy1D E;
77     double L = 6.0; // integration domain [-L, L]
78     int N = 4000;   // grid resolution (even)
79 
80     double Z = simpson_partition(E, L, N);
81     cout << fixed << setprecision(6);
82     cout << "Approximate partition function Z = " << Z << " on [-" << L << ", " << L << "] with N=" << N << "\n";
83 
84     // Build sampler and draw samples
85     GridSampler sampler(E, L, N);
86     mt19937_64 rng(123);
87     int M = 100000;
88     double mean=0.0, m2=0.0; int count=0;
89     for (int i=0;i<M;++i){
90         double x = sampler.sample(rng);
91         ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
92     }
93     double var = m2 / (count-1);
94     cout << "Samples: mean=" << mean << ", var=" << var << " (exact normalization via grid)\n";
95 
96     // Print a few pdf values
97     cout << "p(0) approx = " << exp(-E(0.0))/Z << "\n";
98     cout << "p(1) approx = " << exp(-E(1.0))/Z << "\n";
99     cout << "p(2) approx = " << exp(-E(2.0))/Z << "\n";
100 
101     return 0;
102 }
103

This example illustrates the partition function and exact sampling in 1D. We numerically integrate exp(−E(x)) with Simpson’s rule over [−L, L] to estimate Z and then construct a grid-based PDF/CDF. We sample by inverse transform (binary search + interpolation), which yields near-exact samples in 1D. This is educational for visualizing normalization and probabilities; it does not scale to high dimensions but reinforces the definition p(x) = exp(−E(x))/Z.

Time: O(N) to build the grid and compute Z; O(log N) per sample due to binary searchSpace: O(N) to store grid, PDF, and CDF

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Toy 1D EBM with energy E(x) = ax^2 + bx + lambda*x^4, where lambda = exp(gamma) > 0
5	struct EBM1D {
6	double a; // quadratic coefficient
7	double b; // linear coefficient
8	double gamma; // log of lambda to keep lambda > 0
9
10	EBM1D(double a_=0.1, double b_=0.0, double gamma_=-1.0) : a(a_), b(b_), gamma(gamma_) {}
11
12	inline double lambda() const { return exp(gamma); }
13
14	// Energy function E(x)
15	inline double energy(double x) const {
16	double l = lambda();
17	return axx + bx + lpow(x, 4);
18	}
19
20	// dE/dx used in Langevin dynamics
21	inline double dEdx(double x) const {
22	double l = lambda();
23	return 2.0ax + b + 4.0lpow(x, 3);
24	}
25
26	// Gradients of E wrt parameters at a single x
27	// dE/da = x^2, dE/db = x, dE/dgamma = dE/dlambda * dlambda/dgamma = x^4 * lambda
28	inline array<double,3> grad_params(double x) const {
29	double l = lambda();
30	return {xx, x, lpow(x,4)};
31	}
32	};
33
34	// Gaussian RNG helper
35	struct RNG {
36	mt19937_64 gen;
37	normal_distribution<double> gauss;
38	uniform_real_distribution<double> uni;
39	RNG(uint64_t seed=42) : gen(seed), gauss(0.0,1.0), uni(0.0,1.0) {}
40	inline double randn() { return gauss(gen); }
41	inline double randu() { return uni(gen); }
42	};
43
44	// One step of overdamped Langevin dynamics: x <- x - eps * dE/dx + sqrt(2eps) N(0,1)
45	inline double langevin_step(const EBM1D& model, double x, double eps, RNG &rng) {
46	double noise = sqrt(2.0eps) rng.randn();
47	return x - eps * model.dEdx(x) + noise;
48	}
49
50	// Generate synthetic data: mixture of two Gaussians
51	vector<double> make_data(size_t N, RNG &rng) {
52	vector<double> data;
53	data.reserve(N);
54	for (size_t i=0;i<N;++i) {
55	// 50/50 mixture of N(-2, 0.5^2) and N(2, 0.5^2)
56	bool left = (rng.randu() < 0.5);
57	double mean = left ? -2.0 : 2.0;
58	double stdv = 0.5;
59	data.push_back(mean + stdv * rng.randn());
60	}
61	return data;
62	}
63
64	// Train with Persistent Contrastive Divergence (PCD)
65	void train_ebm(EBM1D &model, const vector<double> &data, int epochs, int batch_size,
66	int K, double eps, double lr, RNG &rng) {
67	// initialize a pool of negative samples (persistent chains)
68	vector<double> neg_pool(batch_size);
69	for (int i=0;i<batch_size;++i) neg_pool[i] = rng.randn();
70
71	size_t N = data.size();
72	vector<int> indices(N);
73	iota(indices.begin(), indices.end(), 0);
74
75	for (int epoch=1; epoch<=epochs; ++epoch) {
76	shuffle(indices.begin(), indices.end(), rng.gen);
77	double loss_proxy = 0.0; // average E(data) - E(model_sample)
78
79	for (size_t start=0; start<N; start+=batch_size) {
80	size_t end = min(start + (size_t)batch_size, N);
81	int bs = (int)(end - start);
82
83	// Positive phase: gradients on data
84	array<double,3> pos = {0.0,0.0,0.0};
85	for (int i=0;i<bs;++i) {
86	double x = data[ indices[start + i] ];
87	auto g = model.grad_params(x);
88	pos[0] += g[0]; pos[1] += g[1]; pos[2] += g[2];
89	loss_proxy += model.energy(x);
90	}
91	for (int j=0;j<3;++j) pos[j] /= bs;
92
93	// Negative phase: update persistent chains and compute gradients
94	array<double,3> neg = {0.0,0.0,0.0};
95	for (int i=0;i<bs;++i) {
96	double x = neg_pool[i];
97	for (int t=0; t<K; ++t) x = langevin_step(model, x, eps, rng);
98	neg_pool[i] = x; // persist
99	auto g = model.grad_params(x);
100	neg[0] += g[0]; neg[1] += g[1]; neg[2] += g[2];
101	loss_proxy -= model.energy(x);
102	}
103	for (int j=0;j<3;++j) neg[j] /= bs;
104
105	// Gradient of NLL: pos - neg
106	model.a -= lr * (pos[0] - neg[0]);
107	model.b -= lr * (pos[1] - neg[1]);
108	model.gamma -= lr * (pos[2] - neg[2]); // keeps lambda=exp(gamma)>0
109	}
110
111	if (epoch % 10 == 0) {
112	cout << "Epoch " << epoch
113	<< " \| a=" << model.a
114	<< " b=" << model.b
115	<< " lambda=" << model.lambda()
116	<< " \| loss_proxy=" << (loss_proxy / N) << "\n";
117	}
118	}
119	}
120
121	// Sample from the trained model via Langevin and report mean/variance
122	pair<double,double> sample_stats(const EBM1D &model, int num_samples, int burn_in, int thinning, RNG &rng) {
123	double x = rng.randn();
124	// burn-in
125	double eps = 0.01;
126	for (int i=0;i<burn_in; ++i) x = langevin_step(model, x, eps, rng);
127
128	double mean=0.0, m2=0.0; int count=0;
129	for (int i=0;i<num_samples*thinning; ++i) {
130	x = langevin_step(model, x, eps, rng);
131	if ((i+1)%thinning==0) {
132	++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
133	}
134	}
135	double var = (count>1) ? (m2 / (count-1)) : 0.0;
136	return {mean, var};
137	}
138
139	int main() {
140	ios::sync_with_stdio(false);
141	cin.tie(nullptr);
142
143	RNG rng(123);
144
145	// 1) Create synthetic dataset
146	auto data = make_data(5000, rng);
147
148	// 2) Initialize model
149	EBM1D model(0.05, 0.0, -0.5); // a, b, gamma (lambda=exp(gamma))
150
151	// 3) Train with PCD + Langevin
152	int epochs = 100;
153	int batch_size = 256;
154	int K = 30; // Langevin steps per update
155	double eps = 0.01; // Langevin step size
156	double lr = 1e-3; // learning rate
157
158	train_ebm(model, data, epochs, batch_size, K, eps, lr, rng);
159
160	// 4) Sample from trained model and print stats
161	auto [mean, var] = sample_stats(model, 5000, /burn_in=/2000, /thinning=/5, rng);
162
163	// Compute empirical data stats for comparison
164	double dmean=0.0, dvar=0.0; for (double x : data) dmean += x; dmean /= data.size();
165	for (double x : data) dvar += (x-dmean)*(x-dmean); dvar /= (double)(data.size()-1);
166
167	cout << fixed << setprecision(4);
168	cout << "Trained params: a=" << model.a << ", b=" << model.b << ", lambda=" << model.lambda() << "\n";
169	cout << "Data mean=" << dmean << ", var=" << dvar << "\n";
170	cout << "Model samples mean=" << mean << ", var=" << var << "\n";
171	return 0;
172	}
173

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Define a normalizable 1D energy function, e.g., E(x) = 0.5x^2 + 0.1x^4
5	struct Energy1D {
6	inline double operator()(double x) const {
7	return 0.5xx + 0.1*pow(x,4);
8	}
9	};
10
11	// Simpson's rule integration over [-L, L] with N even intervals
12	// Returns approximation of integral f(x) dx for f = exp(-E(x))
13	double simpson_partition(const Energy1D &E, double L, int N) {
14	if (N % 2 == 1) ++N; // make even
15	double h = (2.0*L) / N;
16	auto f = [&](double x){ return exp(-E(x)); };
17	double sum = f(-L) + f(L);
18	for (int i=1; i<N; ++i) {
19	double x = -L + i*h;
20	sum += (i%2 ? 4.0 : 2.0) * f(x);
21	}
22	return sum * h / 3.0;
23	}
24
25	// Build PDF and CDF on a grid and sample by inverse transform (binary search)
26	struct GridSampler {
27	vector<double> xs, pdf, cdf;
28	double L; int N; double Z;
29
30	GridSampler(const Energy1D &E, double L_, int N_) : L(L_), N(N_) {
31	if (N % 2 == 1) ++N; // even for Simpson
32	xs.resize(N+1); pdf.resize(N+1); cdf.resize(N+1);
33	double h = (2.0*L) / N;
34	for (int i=0;i<=N;++i) xs[i] = -L + i*h;
35	// approximate Z by Simpson on this grid
36	Z = 0.0;
37	auto f = [&](double x){ return exp(-E(x)); };
38	// Simpson weights
39	double sum = f(xs[0]) + f(xs[N]);
40	for (int i=1; i<N; ++i) sum += (i%2?4.0:2.0)*f(xs[i]);
41	Z = sum * h / 3.0;
42	// fill normalized pdf and cdf (trapezoidal accumulation)
43	for (int i=0;i<=N;++i) pdf[i] = f(xs[i]) / Z;
44	cdf[0] = 0.0;
45	for (int i=1;i<=N;++i) {
46	// trapezoid between xs[i-1] and xs[i]
47	cdf[i] = cdf[i-1] + 0.5(pdf[i-1] + pdf[i])(xs[i]-xs[i-1]);
48	}
49	// normalize any numerical drift
50	for (int i=0;i<=N;++i) cdf[i] /= cdf[N];
51	}
52
53	template<class URNG>
54	double sample(URNG &rng) const {
55	uniform_real_distribution<double> U(0.0,1.0);
56	double u = U(rng);
57	// binary search in cdf
58	int lo = 0, hi = N;
59	while (lo < hi) {
60	int mid = (lo + hi) / 2;
61	if (cdf[mid] < u) lo = mid + 1; else hi = mid;
62	}
63	int i = max(1, lo);
64	// linear interpolate within the bin
65	double x0 = xs[i-1], x1 = xs[i];
66	double c0 = cdf[i-1], c1 = cdf[i];
67	double t = (u - c0) / max(1e-12, (c1 - c0));
68	return x0 + t * (x1 - x0);
69	}
70	};
71
72	int main(){
73	ios::sync_with_stdio(false);
74	cin.tie(nullptr);
75
76	Energy1D E;
77	double L = 6.0; // integration domain [-L, L]
78	int N = 4000; // grid resolution (even)
79
80	double Z = simpson_partition(E, L, N);
81	cout << fixed << setprecision(6);
82	cout << "Approximate partition function Z = " << Z << " on [-" << L << ", " << L << "] with N=" << N << "\n";
83
84	// Build sampler and draw samples
85	GridSampler sampler(E, L, N);
86	mt19937_64 rng(123);
87	int M = 100000;
88	double mean=0.0, m2=0.0; int count=0;
89	for (int i=0;i<M;++i){
90	double x = sampler.sample(rng);
91	++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean);
92	}
93	double var = m2 / (count-1);
94	cout << "Samples: mean=" << mean << ", var=" << var << " (exact normalization via grid)\n";
95
96	// Print a few pdf values
97	cout << "p(0) approx = " << exp(-E(0.0))/Z << "\n";
98	cout << "p(1) approx = " << exp(-E(1.0))/Z << "\n";
99	cout << "p(2) approx = " << exp(-E(2.0))/Z << "\n";
100
101	return 0;
102	}
103