Energy-Based Models (EBM)
Key Points
- ā¢Energy-Based Models (EBMs) define probabilities through an energy landscape: low energy means high probability, with p(x) = exp(-E(x)) / Z.
- ā¢The partition function Z is a normalizing constant obtained by integrating exp(-E(x)) over all x, and it is usually intractable to compute exactly in high dimensions.
- ā¢Training EBMs by maximum likelihood leads to a positive phase (lowering energy on data) and a negative phase (raising energy on model samples).
- ā¢Sampling from EBMs typically requires Markov Chain Monte Carlo (MCMC) methods such as Langevin dynamics or MetropolisāHastings.
- ā¢The score ā_x log p(x) equals āā_x E(x), which is useful for sampling and score-matching training objectives.
- ā¢Numerical normalization and exact sampling are feasible in 1D but become impractical in higher dimensions, where MCMC is preferred.
- ā¢Properly shaping the energy (e.g., ensuring it grows to +ā) is critical so that Z is finite and the model defines a valid probability distribution.
- ā¢In C++, you can implement toy EBMs, train them with contrastive divergence, and sample using short-run Langevin dynamics to visualize learned distributions.
Prerequisites
- āProbability density and distributions ā Understanding densities, normalization, and expectations is essential for interpreting p(x) and Z.
- āMultivariable calculus ā Gradients with respect to parameters and inputs (āĪøE and āxE) are central to training and sampling.
- āMarkov Chain Monte Carlo (MCMC) basics ā Sampling from EBMs typically uses Langevin dynamics or MetropolisāHastings.
- āOptimization and stochastic gradient descent ā Maximum likelihood training of EBMs relies on iterative gradient-based updates.
- āNumerical integration ā Approximating Z in low dimensions uses quadrature such as Simpsonās rule.
- āC++ programming and random number generation ā Implementing EBMs requires careful use of RNGs, loops, and numeric stability.
- āLinear algebra (optional for higher dimensions) ā Vector gradients and energies like quadratic forms appear in multivariate EBMs.
- āBasic statistics ā Evaluating sample quality with means, variances, and histograms helps diagnose models.
Detailed Explanation
Tap terms for definitions01Overview
Energy-Based Models (EBMs) are probabilistic models that assign an energy score to each configuration x. Instead of directly modeling the probability density p(x), EBMs define an unnormalized density using an energy function E(x; Īø) so that states with low energy are more likely. The normalized probability is obtained by exponentiating the negative energy and dividing by a normalizing constant, the partition function Z(Īø), which sums or integrates over all possible states. This framework unifies many models from statistical physics and machine learning: the lower the energy, the more probable the state. EBMs are appealing because you can specify the structure of the energy function without needing to express Z explicitly. However, computing Z or its gradient is generally intractable in high dimensions, so EBMs rely on approximate training and sampling techniques such as MCMC (e.g., Langevin dynamics) and contrastive divergence. EBMs can represent complicated, multi-modal distributions and can be extended to conditional settings (energies over pairs (x, y)). They are used for generative modeling, out-of-distribution detection, and structured prediction. The core trade-off is flexibility versus computational cost: you gain expressive power at the price of more challenging normalization and sampling.
02Intuition & Analogies
Imagine a mountain landscape where each location x corresponds to a possible data point. The elevation at x is the energy E(x): deep valleys correspond to likely data (low energy), and steep mountains correspond to unlikely data (high energy). If you dropped a marble at a random spot and let gravity plus a little random shaking move it around, it would spend most of its time in valleys. That behavior mirrors sampling from p(x) ā exp(āE(x)): low energy regions get visited more because they have higher probability. The partition function Z is like the total āvolumeā of all the valleys after exponentiating the negative elevations; it ensures probabilities sum (or integrate) to 1. In low dimensions, you could carefully measure this total volume by gridding the terrain, but in high dimensions the terrain is too vast to scan exactly. Instead, you explore it with guided random walks (MCMC), nudging the marble downhill (following āāE) while adding some noise so it doesnāt get stuck. Training an EBM is like reshaping the terrain: you press down to deepen valleys where your real data lives (positive phase) and push up to raise terrain where your model tends to wander but where real data is scarce (negative phase). If you press uniformly everywhere, probabilities donāt change because Z adjusts. If the terrain doesnāt rise to infinity as you move far away, the total volume becomes infinite and probabilities canāt be normalizedāso a good E(x) must go to +ā at the edges. EBMs thus boil down to designing a meaningful landscape and learning to sculpt it using samples.
03Formal Definition
04When to Use
Use EBMs when you want flexible generative models that can represent complex, multi-modal distributions without committing to a tractable normalizer. They are useful when scoring configurations is easier than normalizing, such as in image modeling where a neural network can produce an energy for a pixel array. EBMs are appropriate when you can afford approximate inference via MCMC and when obtaining exact likelihoods is less critical than generating plausible samples or ranking inputs by energy. They shine in out-of-distribution detection: energies can act as anomaly scores, with higher energies indicating atypical inputs. EBMs also work well for structured prediction by defining energies over pairs (x, y), letting you perform prediction by searching for y that minimizes E(x, y). In low dimensions (e.g., 1D), EBMs allow exact or high-accuracy normalization by numerical integration, which is educational for understanding Z. Choose EBMs if you can design an energy landscape that encodes prior knowledge and if your application can tolerate the computational cost of sampling during training and inference. Avoid EBMs when you need exact, fast likelihoods in high dimensions or when MCMC mixing is prohibitively slow.
ā ļøCommon Mistakes
- Ignoring normalizability: Choosing E(x) that does not grow to +ā causes Z = \infty and invalid probabilities. Include stabilizing terms (e.g., \lambda |x|^4) or constrain parameters so energy rises at infinity.
- Sign errors in gradients: The MLE gradient has positive and negative phases. Confusing signs leads to pushing probability mass in the wrong direction. Double-check: \nabla_\theta L = \mathbb{E}{\text{data}}[\nabla\theta E] - \mathbb{E}{p\theta}[\nabla_\theta E].
- Poor MCMC settings: Step sizes too large explode; too small under-mix. Always include noise in Langevin (\sqrt{2\varepsilon},\xi) and use burn-in. Monitor acceptance if using MetropolisāHastings.
- Collapsing parameterizations: Including a free constant offset c in E can drift because it cancels in p(x). Fix the offset or regularize to prevent ill-conditioning.
- Estimating Z naively in high dimensions: Simple Monte Carlo with poor proposals has huge variance. Prefer annealed importance sampling or avoid Z by score matching or contrastive divergence.
- Short-run CD misuse: Very few MCMC steps can bias learning; use persistent chains or increase steps gradually.
- Evaluating only training loss: Also check sample quality, mixing diagnostics, and energy histograms for data vs. model samples.
- Overfitting energy wells: Strongly overfitted energies create spiky distributions that are hard to sample. Use regularization and noise robustness tests.
Key Formulas
EBM density
Explanation: The probability of x is the exponential of negative energy divided by the partition function. Lower energy implies higher probability after normalization.
Partition function (continuous)
Explanation: The partition function is the integral of the unnormalized density over all x. It ensures probabilities integrate to one.
Log-likelihood
Explanation: Maximum likelihood increases probability on data by lowering their energies and accounting for the normalization through log Z.
Positive/negative phase gradient
Explanation: The gradient splits into a term that encourages low energy on data and a term that discourages low energy where the model places mass.
NLL gradient
Explanation: When minimizing negative log-likelihood, the sign flips: descend by reducing data energy and increasing model-sample energy.
Overdamped Langevin update
Explanation: A sampling step that moves downhill in energy while adding Gaussian noise for exploration. The stationary distribution is the EBM density.
MetropolisāHastings acceptance
Explanation: A new state xā² is accepted with this probability to maintain detailed balance. If proposals are symmetric, the q terms cancel.
Score of an EBM
Explanation: The gradient of log-density with respect to x is just negative energy gradient because Z does not depend on x.
Score matching objective
Explanation: Score matching fits the modelās score to the dataās score without computing Z. It replaces log-density derivatives with those of the energy.
Importance sampling for Z
Explanation: The partition function can be estimated by sampling from a proposal q(x) and averaging weighted terms. Good proposals are critical for low variance.
KL divergence
Explanation: Maximum likelihood minimizes this divergence, aligning the model with the data distribution. The constant term [ p_{}] does not affect optimization.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Toy 1D EBM with energy E(x) = a*x^2 + b*x + lambda*x^4, where lambda = exp(gamma) > 0 5 struct EBM1D { 6 double a; // quadratic coefficient 7 double b; // linear coefficient 8 double gamma; // log of lambda to keep lambda > 0 9 10 EBM1D(double a_=0.1, double b_=0.0, double gamma_=-1.0) : a(a_), b(b_), gamma(gamma_) {} 11 12 inline double lambda() const { return exp(gamma); } 13 14 // Energy function E(x) 15 inline double energy(double x) const { 16 double l = lambda(); 17 return a*x*x + b*x + l*pow(x, 4); 18 } 19 20 // dE/dx used in Langevin dynamics 21 inline double dEdx(double x) const { 22 double l = lambda(); 23 return 2.0*a*x + b + 4.0*l*pow(x, 3); 24 } 25 26 // Gradients of E wrt parameters at a single x 27 // dE/da = x^2, dE/db = x, dE/dgamma = dE/dlambda * dlambda/dgamma = x^4 * lambda 28 inline array<double,3> grad_params(double x) const { 29 double l = lambda(); 30 return {x*x, x, l*pow(x,4)}; 31 } 32 }; 33 34 // Gaussian RNG helper 35 struct RNG { 36 mt19937_64 gen; 37 normal_distribution<double> gauss; 38 uniform_real_distribution<double> uni; 39 RNG(uint64_t seed=42) : gen(seed), gauss(0.0,1.0), uni(0.0,1.0) {} 40 inline double randn() { return gauss(gen); } 41 inline double randu() { return uni(gen); } 42 }; 43 44 // One step of overdamped Langevin dynamics: x <- x - eps * dE/dx + sqrt(2*eps) * N(0,1) 45 inline double langevin_step(const EBM1D& model, double x, double eps, RNG &rng) { 46 double noise = sqrt(2.0*eps) * rng.randn(); 47 return x - eps * model.dEdx(x) + noise; 48 } 49 50 // Generate synthetic data: mixture of two Gaussians 51 vector<double> make_data(size_t N, RNG &rng) { 52 vector<double> data; 53 data.reserve(N); 54 for (size_t i=0;i<N;++i) { 55 // 50/50 mixture of N(-2, 0.5^2) and N(2, 0.5^2) 56 bool left = (rng.randu() < 0.5); 57 double mean = left ? -2.0 : 2.0; 58 double stdv = 0.5; 59 data.push_back(mean + stdv * rng.randn()); 60 } 61 return data; 62 } 63 64 // Train with Persistent Contrastive Divergence (PCD) 65 void train_ebm(EBM1D &model, const vector<double> &data, int epochs, int batch_size, 66 int K, double eps, double lr, RNG &rng) { 67 // initialize a pool of negative samples (persistent chains) 68 vector<double> neg_pool(batch_size); 69 for (int i=0;i<batch_size;++i) neg_pool[i] = rng.randn(); 70 71 size_t N = data.size(); 72 vector<int> indices(N); 73 iota(indices.begin(), indices.end(), 0); 74 75 for (int epoch=1; epoch<=epochs; ++epoch) { 76 shuffle(indices.begin(), indices.end(), rng.gen); 77 double loss_proxy = 0.0; // average E(data) - E(model_sample) 78 79 for (size_t start=0; start<N; start+=batch_size) { 80 size_t end = min(start + (size_t)batch_size, N); 81 int bs = (int)(end - start); 82 83 // Positive phase: gradients on data 84 array<double,3> pos = {0.0,0.0,0.0}; 85 for (int i=0;i<bs;++i) { 86 double x = data[ indices[start + i] ]; 87 auto g = model.grad_params(x); 88 pos[0] += g[0]; pos[1] += g[1]; pos[2] += g[2]; 89 loss_proxy += model.energy(x); 90 } 91 for (int j=0;j<3;++j) pos[j] /= bs; 92 93 // Negative phase: update persistent chains and compute gradients 94 array<double,3> neg = {0.0,0.0,0.0}; 95 for (int i=0;i<bs;++i) { 96 double x = neg_pool[i]; 97 for (int t=0; t<K; ++t) x = langevin_step(model, x, eps, rng); 98 neg_pool[i] = x; // persist 99 auto g = model.grad_params(x); 100 neg[0] += g[0]; neg[1] += g[1]; neg[2] += g[2]; 101 loss_proxy -= model.energy(x); 102 } 103 for (int j=0;j<3;++j) neg[j] /= bs; 104 105 // Gradient of NLL: pos - neg 106 model.a -= lr * (pos[0] - neg[0]); 107 model.b -= lr * (pos[1] - neg[1]); 108 model.gamma -= lr * (pos[2] - neg[2]); // keeps lambda=exp(gamma)>0 109 } 110 111 if (epoch % 10 == 0) { 112 cout << "Epoch " << epoch 113 << " | a=" << model.a 114 << " b=" << model.b 115 << " lambda=" << model.lambda() 116 << " | loss_proxy=" << (loss_proxy / N) << "\n"; 117 } 118 } 119 } 120 121 // Sample from the trained model via Langevin and report mean/variance 122 pair<double,double> sample_stats(const EBM1D &model, int num_samples, int burn_in, int thinning, RNG &rng) { 123 double x = rng.randn(); 124 // burn-in 125 double eps = 0.01; 126 for (int i=0;i<burn_in; ++i) x = langevin_step(model, x, eps, rng); 127 128 double mean=0.0, m2=0.0; int count=0; 129 for (int i=0;i<num_samples*thinning; ++i) { 130 x = langevin_step(model, x, eps, rng); 131 if ((i+1)%thinning==0) { 132 ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean); 133 } 134 } 135 double var = (count>1) ? (m2 / (count-1)) : 0.0; 136 return {mean, var}; 137 } 138 139 int main() { 140 ios::sync_with_stdio(false); 141 cin.tie(nullptr); 142 143 RNG rng(123); 144 145 // 1) Create synthetic dataset 146 auto data = make_data(5000, rng); 147 148 // 2) Initialize model 149 EBM1D model(0.05, 0.0, -0.5); // a, b, gamma (lambda=exp(gamma)) 150 151 // 3) Train with PCD + Langevin 152 int epochs = 100; 153 int batch_size = 256; 154 int K = 30; // Langevin steps per update 155 double eps = 0.01; // Langevin step size 156 double lr = 1e-3; // learning rate 157 158 train_ebm(model, data, epochs, batch_size, K, eps, lr, rng); 159 160 // 4) Sample from trained model and print stats 161 auto [mean, var] = sample_stats(model, 5000, /*burn_in=*/2000, /*thinning=*/5, rng); 162 163 // Compute empirical data stats for comparison 164 double dmean=0.0, dvar=0.0; for (double x : data) dmean += x; dmean /= data.size(); 165 for (double x : data) dvar += (x-dmean)*(x-dmean); dvar /= (double)(data.size()-1); 166 167 cout << fixed << setprecision(4); 168 cout << "Trained params: a=" << model.a << ", b=" << model.b << ", lambda=" << model.lambda() << "\n"; 169 cout << "Data mean=" << dmean << ", var=" << dvar << "\n"; 170 cout << "Model samples mean=" << mean << ", var=" << var << "\n"; 171 return 0; 172 } 173
This program trains a simple 1D EBM with a stabilizing quartic term so that the model is normalizable. It uses Persistent Contrastive Divergence (PCD): a pool of negative samples is updated for K Langevin steps per iteration to approximate the model expectation. The gradient of the negative log-likelihood is computed as E_data[āE/āĪø] ā E_model[āE/āĪø], and parameters are updated with SGD. After training, we draw Langevin samples and report mean and variance to compare with the synthetic data. The code highlights the positive/negative phase split, Langevin sampling, and the importance of making the energy grow at infinity via Ī»x^4.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Define a normalizable 1D energy function, e.g., E(x) = 0.5*x^2 + 0.1*x^4 5 struct Energy1D { 6 inline double operator()(double x) const { 7 return 0.5*x*x + 0.1*pow(x,4); 8 } 9 }; 10 11 // Simpson's rule integration over [-L, L] with N even intervals 12 // Returns approximation of integral f(x) dx for f = exp(-E(x)) 13 double simpson_partition(const Energy1D &E, double L, int N) { 14 if (N % 2 == 1) ++N; // make even 15 double h = (2.0*L) / N; 16 auto f = [&](double x){ return exp(-E(x)); }; 17 double sum = f(-L) + f(L); 18 for (int i=1; i<N; ++i) { 19 double x = -L + i*h; 20 sum += (i%2 ? 4.0 : 2.0) * f(x); 21 } 22 return sum * h / 3.0; 23 } 24 25 // Build PDF and CDF on a grid and sample by inverse transform (binary search) 26 struct GridSampler { 27 vector<double> xs, pdf, cdf; 28 double L; int N; double Z; 29 30 GridSampler(const Energy1D &E, double L_, int N_) : L(L_), N(N_) { 31 if (N % 2 == 1) ++N; // even for Simpson 32 xs.resize(N+1); pdf.resize(N+1); cdf.resize(N+1); 33 double h = (2.0*L) / N; 34 for (int i=0;i<=N;++i) xs[i] = -L + i*h; 35 // approximate Z by Simpson on this grid 36 Z = 0.0; 37 auto f = [&](double x){ return exp(-E(x)); }; 38 // Simpson weights 39 double sum = f(xs[0]) + f(xs[N]); 40 for (int i=1; i<N; ++i) sum += (i%2?4.0:2.0)*f(xs[i]); 41 Z = sum * h / 3.0; 42 // fill normalized pdf and cdf (trapezoidal accumulation) 43 for (int i=0;i<=N;++i) pdf[i] = f(xs[i]) / Z; 44 cdf[0] = 0.0; 45 for (int i=1;i<=N;++i) { 46 // trapezoid between xs[i-1] and xs[i] 47 cdf[i] = cdf[i-1] + 0.5*(pdf[i-1] + pdf[i])*(xs[i]-xs[i-1]); 48 } 49 // normalize any numerical drift 50 for (int i=0;i<=N;++i) cdf[i] /= cdf[N]; 51 } 52 53 template<class URNG> 54 double sample(URNG &rng) const { 55 uniform_real_distribution<double> U(0.0,1.0); 56 double u = U(rng); 57 // binary search in cdf 58 int lo = 0, hi = N; 59 while (lo < hi) { 60 int mid = (lo + hi) / 2; 61 if (cdf[mid] < u) lo = mid + 1; else hi = mid; 62 } 63 int i = max(1, lo); 64 // linear interpolate within the bin 65 double x0 = xs[i-1], x1 = xs[i]; 66 double c0 = cdf[i-1], c1 = cdf[i]; 67 double t = (u - c0) / max(1e-12, (c1 - c0)); 68 return x0 + t * (x1 - x0); 69 } 70 }; 71 72 int main(){ 73 ios::sync_with_stdio(false); 74 cin.tie(nullptr); 75 76 Energy1D E; 77 double L = 6.0; // integration domain [-L, L] 78 int N = 4000; // grid resolution (even) 79 80 double Z = simpson_partition(E, L, N); 81 cout << fixed << setprecision(6); 82 cout << "Approximate partition function Z = " << Z << " on [-" << L << ", " << L << "] with N=" << N << "\n"; 83 84 // Build sampler and draw samples 85 GridSampler sampler(E, L, N); 86 mt19937_64 rng(123); 87 int M = 100000; 88 double mean=0.0, m2=0.0; int count=0; 89 for (int i=0;i<M;++i){ 90 double x = sampler.sample(rng); 91 ++count; double delta = x - mean; mean += delta / count; m2 += delta * (x - mean); 92 } 93 double var = m2 / (count-1); 94 cout << "Samples: mean=" << mean << ", var=" << var << " (exact normalization via grid)\n"; 95 96 // Print a few pdf values 97 cout << "p(0) approx = " << exp(-E(0.0))/Z << "\n"; 98 cout << "p(1) approx = " << exp(-E(1.0))/Z << "\n"; 99 cout << "p(2) approx = " << exp(-E(2.0))/Z << "\n"; 100 101 return 0; 102 } 103
This example illustrates the partition function and exact sampling in 1D. We numerically integrate exp(āE(x)) with Simpsonās rule over [āL, L] to estimate Z and then construct a grid-based PDF/CDF. We sample by inverse transform (binary search + interpolation), which yields near-exact samples in 1D. This is educational for visualizing normalization and probabilities; it does not scale to high dimensions but reinforces the definition p(x) = exp(āE(x))/Z.