📚TheoryAdvanced

Optimal Transport Theory

Key Points

  • Optimal Transport (OT) formalizes the cheapest way to move one probability distribution into another given a cost to move mass.
  • The Monge problem looks for a deterministic map T, while the Kantorovich relaxation allows splitting mass via a coupling making the problem convex and solvable.
  • The p-Wasserstein distance is the OT-based metric that measures how far two distributions are, and it is a true metric under mild conditions.
  • In one dimension, can be computed efficiently using cumulative distribution functions, making it very fast.
  • Entropic regularization adds a small entropy term that makes the problem smooth and enables the fast Sinkhorn algorithm.
  • Sinkhorn scales two vectors u and v to match marginals and computes an approximate transport plan Π = diag(u) K diag(v).
  • OT is widely used in ML for domain adaptation, generative modeling (Wasserstein GANs), fairness, and dataset comparison.
  • Practical implementations must ensure valid probability vectors, a well-scaled cost matrix, and numerically stable computations (often log-domain).

Prerequisites

  • Probability distributions and expectationsOT compares probability measures and uses expectations in objectives.
  • Linear algebra (vectors, matrices, inner products)Discrete OT uses matrices for cost and transport plans, with constraints expressed via matrix-vector operations.
  • Convex optimization and linear programming basicsKantorovich OT is a convex problem; understanding constraints, duality, and optimality conditions helps.
  • Numerical stability (log-sum-exp, normalization)Sinkhorn requires stable exponentials and careful handling of small/zero masses.
  • Metric spaces and normsWasserstein distances rely on ground metrics like Euclidean norms and their properties.
  • Algorithmic complexityChoosing between LP, Sinkhorn, or 1D methods depends on time/space trade-offs.

Detailed Explanation

Tap terms for definitions

01Overview

Optimal Transport (OT) studies how to transform one probability distribution into another in the cheapest possible way, where the cost reflects how difficult it is to move mass from one location to another. Imagine two piles of sand shaped differently; moving sand from the first shape into the second shape costs energy proportional to how far you move each grain. Mathematically, OT takes two distributions μ and ν over a space (like points in 1D, 2D, etc.) and a ground cost c(x, y) that quantifies the price to move a unit of mass from x to y. The original Monge problem asks for a transport map T that pushes μ onto ν while minimizing the total cost. Because Monge’s map can fail to exist or be hard to compute, the Kantorovich relaxation allows splitting mass and seeks a coupling π(x, y)—a joint distribution with marginals μ and ν—that minimizes expected cost. A central outcome is the Wasserstein distance W_p, which turns distributions into points in a geometric space where distances reflect how much work is needed to morph one distribution into another. This geometric viewpoint is powerful: W_p is a metric, behaves well under sampling, and captures differences in support and geometry better than, say, KL divergence when supports do not overlap. In practice, entropic regularization yields smooth, fast approximations via the Sinkhorn algorithm, which has become a standard tool for large-scale applications.

02Intuition & Analogies

Think of two skyline outlines made of stacked boxes: one is the current skyline (source distribution), the other is the target skyline (destination distribution). Your job is to slide boxes horizontally to reshape source into target while paying per unit distance moved. If you are forced to move each box as a whole to a single target box (no splitting), that’s like the Monge setting: each piece travels to exactly one destination. Real life is messier—sometimes you split a box to fill multiple gaps—which is precisely what the Kantorovich formulation allows: mass can be split across multiple destinations so the total cost is minimized.

Comparing distributions often fails with traditional divergences when they don’t overlap—like two non-overlapping skylines. KL or JS divergence can saturate or become infinite because they compare heights where one skyline is zero. OT, by contrast, asks: how far, physically, do I need to shift mass to make them match? If two skylines are identical but shifted by 1 unit, OT says the distance is about that 1 unit, which matches our geometric intuition.

Entropic regularization adds a small amount of “fuzziness” so instead of a brittle, sharp plan of where each unit goes, you get a slightly diffused plan. This fuzziness makes the computation much faster because the problem becomes strongly convex and amenable to matrix scaling. Sinkhorn’s algorithm repeatedly normalizes rows and columns of a Gibbs kernel, like adjusting knobs until both source and target totals are right. The end result is a near-optimal plan that respects the geometry while being computationally tractable for large problems.

03Formal Definition

Let (X, d) be a Polish metric space (complete, separable) and ν be probability measures on X with finite p-th moments. Given a cost c(x, y) = d(x, y)^p, the Monge problem seeks a measurable map T: that pushes μ to ν (T_{#}μ = ν) and minimizes the transport cost: \( c(x, T(x)) \, d(x)\). This problem may be ill-posed (no feasible T) or non-convex. Kantorovich relaxes the problem to couplings: find a joint probability π on X × X with marginals μ and ν (π ∈ Π( ν)) that minimizes \( c(x,y) \, d(x,y)\). This is a linear program in the discrete case and a convex optimization problem generally. The p-Wasserstein distance is then defined by \((,) = \left( d(x,y)^p \, d(x,y) \right)^{1/p}\). is a metric on the space of probability measures with finite p-th moment. For discrete histograms a ∈ Δ^n and b ∈ Δ^m on supports {} and {}, with cost matrix , )^p, the Kantorovich problem is: minimize ⟨C, Π⟩ over Π ∈ such that Π 1 = a and Π^⊤ 1 = b. Entropic regularization adds ε Π_{ij}(log Π_{ij} − 1), yielding a strictly convex problem whose solution has the form Π* = diag(u) K diag(v) with ). Sinkhorn iterations ⊘ (K v), ⊘ (K^⊤ u) compute u and v.

04When to Use

Use Optimal Transport when geometry matters—when points lying near each other should not be penalized as heavily as points far apart. This is especially valuable when supports do not overlap (e.g., two point clouds shifted apart), where divergences like KL may fail or be infinite. Specific scenarios include:

  • Generative modeling: Wasserstein GANs use a dual form of W_1 to stabilize training by providing meaningful gradients even when supports are disjoint.
  • Domain adaptation: Align feature distributions across domains by minimizing a Wasserstein distance to reduce covariate shift.
  • Dataset comparison: Quantify how different datasets are, taking spatial structure into account (e.g., comparing images, word embeddings, or sensor readings).
  • Fairness and recourse: Match score distributions across demographic groups using OT to assess and reduce disparities while preserving structure.
  • Clustering and barycenters: Compute Wasserstein barycenters to find a central prototype distribution of multiple datasets.
  • 1D signals and histograms: When data lies on a line (e.g., grayscale intensities), W_1 is very fast via cumulative sums.
  • Large-scale approximate matching: Entropic regularization plus Sinkhorn enables near-linear algebra style routines that scale to tens or hundreds of thousands of points, especially with structure (e.g., convolutional kernels).

⚠️Common Mistakes

  • Not normalizing histograms: OT requires equal total mass. Forgetting to normalize a and b (or to handle unmatched mass) leads to infeasible constraints or misleading results.
  • Cost scaling issues: If C has very large values and ε is small, exp(−C/ε) underflows to zero, breaking Sinkhorn. Rescale features or costs (e.g., divide distances by a characteristic scale) and choose ε consistent with cost magnitudes.
  • Using ε too small: Pursuing almost-unregularized solutions with tiny ε can cause numerical instability and slow convergence. Prefer a moderate ε first; if needed, use ε-scaling (start large, then decrease) with log-domain stabilization.
  • Ignoring marginal errors: Stopping Sinkhorn without checking that row/column sums match a, b within tolerance yields infeasible couplings. Monitor marginal residuals and set a clear tolerance.
  • Confusing assignment with OT: The Hungarian algorithm solves one-to-one matching; general OT allows splitting mass and different histogram sizes. Use linear programming or Sinkhorn for general couplings.
  • Misinterpreting WGAN dual: The critic must be 1-Lipschitz. Without proper constraints (e.g., gradient penalty), the learned distance is not a true Wasserstein distance.
  • Overlooking 1D shortcuts: For 1D data, computing W_1 via CDFs is O(n log n), far faster and more accurate than generic LP or Sinkhorn—don’t miss this optimization.
  • Precision pitfalls: Summing many tiny Π_{ij} terms can accumulate error. Use double precision, stable reductions, and optionally compute in the log-domain where feasible.

Key Formulas

Monge Problem

Explanation: Find a measurable map T that pushes μ to ν with minimum expected cost. This is often hard or ill-posed because it forbids splitting mass.

Kantorovich Relaxation

Explanation: Optimize over couplings π with fixed marginals. This convex relaxation always admits solutions and is the standard practical formulation.

p-Wasserstein Distance

Explanation: Defines a metric between probability measures with finite p-th moments. It measures the minimum effort to transform one distribution into the other.

Discrete OT Linear Program

Explanation: For histograms a and b with cost matrix C, choose nonnegative Π to match row/column sums while minimizing total cost. Angle brackets denote the Frobenius inner product.

Kantorovich–Rubinstein Duality

Explanation: The 1-Wasserstein distance equals the largest expected difference over all 1-Lipschitz functions. This is the basis of WGAN critics.

1D W_1 via CDFs

Explanation: In one dimension, the Earth Mover’s Distance equals the area between CDFs. This enables fast O(n n) algorithms.

Entropic Regularized OT

Explanation: Adds a negative-entropy term weighted by ε to smooth the optimization and improve computational speed and stability.

Sinkhorn Form

Explanation: The optimal entropic plan factors through left/right scalings of a Gibbs kernel. The vectors u and v are chosen so that row/column sums match a and b.

Sinkhorn Updates

Explanation: Elementwise divisions update scaling vectors to enforce the marginal constraints. Each iteration costs a pair of matrix-vector products.

Triangle Inequality

Explanation: is a true metric; it satisfies non-negativity, symmetry, identity of indiscernibles, and the triangle inequality.

Metrizes Weak Convergence (with moments)

Explanation: Convergence in implies weak convergence plus convergence of p-th moments. This makes statistically meaningful for empirical measures.

Sinkhorn Complexity

Explanation: Each Sinkhorn iteration requires two dense matrix-vector products. With I iterations, runtime grows linearly in I and the size of the cost matrix.

Complexity Analysis

Discrete OT as a linear program over an n×m transport matrix has worst-case polynomial complexity (e.g., interior point methods with high constants) and memory O(nm). The assignment special case (, one-to-one) can be solved in O() by the Hungarian algorithm. In contrast, the entropically-regularized problem is solved via Sinkhorn iterations with per-iteration cost dominated by two matrix-vector products: O(n m) time and O(n m) memory if K is dense. With I iterations to reach tolerance, the overall time is O(n m I). In practice, I scales sublinearly with 1/ε for well-scaled problems; (continuation) can further reduce total iterations. If the cost is structured (e.g., grid with separable or convolutional costs), K-vector products can be accelerated via FFTs, reducing per-iteration time to near-linear. Numerical stability is critical. For small ε or large costs, storing ) can underflow; log-domain implementations compute log-sum-exp reductions at O(n m) per iteration with similar constants but greatly improved stability. Space usage is O(n m) to store C (and optionally logK); forming Π explicitly also costs O(n m). If only the cost value is needed, one can avoid materializing Π and compute ∑ Π_{ij} on-the-fly. In 1D, can be computed in O((n + m) log(n + m)) time (for sorting) and O(1) extra space beyond the input, by sweeping over the merged supports and integrating the absolute difference of CDFs. This is orders of magnitude faster and more memory-efficient than generic OT solvers on the line.

Code Examples

Sinkhorn Algorithm (Log-Stabilized) for Entropic OT between Two Histograms
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute log-sum-exp of a vector of terms safely
5static double logsumexp(const vector<double>& v) {
6 double m = -numeric_limits<double>::infinity();
7 for (double x : v) m = max(m, x);
8 if (!isfinite(m)) return m; // all -inf
9 double s = 0.0;
10 for (double x : v) s += exp(x - m);
11 return m + log(s);
12}
13
14struct SinkhornResult {
15 vector<vector<double>> plan; // Transport plan Π (n x m)
16 double reg_cost; // Regularized cost: <C,Π> + ε * sum Π_{ij}(log Π_{ij} - 1)
17 double unreg_cost; // Transport cost <C,Π>
18 double row_violation; // ||Π1 - a||_1
19 double col_violation; // ||Π^T1 - b||_1
20 int iters; // Iterations performed
21};
22
23// Entropic OT via log-domain Sinkhorn.
24// a: size n, b: size m (must be nonnegative and sum to 1), C: n x m cost matrix (nonnegative), epsilon > 0
25SinkhornResult sinkhorn_logdomain(const vector<double>& a_in,
26 const vector<double>& b_in,
27 const vector<vector<double>>& C,
28 double epsilon = 0.05,
29 int max_iters = 10_000,
30 double tol = 1e-9) {
31 int n = (int)a_in.size();
32 int m = (int)b_in.size();
33 // Normalize a and b to sum to 1
34 auto a = a_in, b = b_in;
35 auto sumvec = [](const vector<double>& v){ double s=0; for(double x: v) s+=x; return s; };
36 double sa = sumvec(a), sb = sumvec(b);
37 if (sa <= 0 || sb <= 0) throw runtime_error("Input histograms must have positive total mass.");
38 for (double& x : a) x /= sa; for (double& y : b) y /= sb;
39
40 // Precompute log a and log b, handling zeros as -inf
41 vector<double> loga(n), logb(m);
42 for (int i = 0; i < n; ++i) loga[i] = (a[i] > 0) ? log(a[i]) : -numeric_limits<double>::infinity();
43 for (int j = 0; j < m; ++j) logb[j] = (b[j] > 0) ? log(b[j]) : -numeric_limits<double>::infinity();
44
45 // logK = -C/epsilon
46 vector<vector<double>> logK(n, vector<double>(m));
47 for (int i = 0; i < n; ++i) {
48 for (int j = 0; j < m; ++j) {
49 logK[i][j] = -C[i][j] / epsilon;
50 }
51 }
52
53 // Initialize log u and log v to zeros
54 vector<double> logu(n, 0.0), logv(m, 0.0);
55
56 int it = 0;
57 for (; it < max_iters; ++it) {
58 // Update logu: logu_i = log a_i - logsumexp_j (logK_ij + logv_j)
59 vector<double> new_logu(n);
60 for (int i = 0; i < n; ++i) {
61 vector<double> tmp(m);
62 for (int j = 0; j < m; ++j) tmp[j] = logK[i][j] + logv[j];
63 double lse = logsumexp(tmp);
64 new_logu[i] = loga[i] - lse;
65 }
66 logu.swap(new_logu);
67
68 // Update logv: logv_j = log b_j - logsumexp_i (logK_ij + logu_i)
69 vector<double> new_logv(m);
70 for (int j = 0; j < m; ++j) {
71 vector<double> tmp(n);
72 for (int i = 0; i < n; ++i) tmp[i] = logK[i][j] + logu[i];
73 double lse = logsumexp(tmp);
74 new_logv[j] = logb[j] - lse;
75 }
76 logv.swap(new_logv);
77
78 // Check marginal violations every few iterations (here each iteration for clarity)
79 // Compute Π implicitly to get row/col sums:
80 double row_err = 0.0, col_err = 0.0;
81 vector<double> rowsum(n, 0.0), colsum(m, 0.0);
82 for (int i = 0; i < n; ++i) {
83 for (int j = 0; j < m; ++j) {
84 double logP = logu[i] + logK[i][j] + logv[j];
85 double P = exp(logP);
86 rowsum[i] += P;
87 colsum[j] += P;
88 }
89 }
90 for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
91 for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
92 if (row_err + col_err < tol) break;
93 }
94
95 // Build explicit plan and compute costs and violations
96 vector<vector<double>> P(n, vector<double>(m));
97 double cost = 0.0;
98 double reg_term = 0.0; // sum Π_{ij}(log Π_{ij} - 1)
99 vector<double> rowsum(n, 0.0), colsum(m, 0.0);
100 for (int i = 0; i < n; ++i) {
101 for (int j = 0; j < m; ++j) {
102 double logP = logu[i] + logK[i][j] + logv[j];
103 double pij = exp(logP);
104 P[i][j] = pij;
105 cost += pij * C[i][j];
106 if (pij > 0) reg_term += pij * (log(pij) - 1.0);
107 rowsum[i] += pij;
108 colsum[j] += pij;
109 }
110 }
111 double row_err = 0.0, col_err = 0.0;
112 for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
113 for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
114
115 SinkhornResult res;
116 res.plan = move(P);
117 res.unreg_cost = cost;
118 res.reg_cost = cost + epsilon * reg_term;
119 res.row_violation = row_err;
120 res.col_violation = col_err;
121 res.iters = it + 1;
122 return res;
123}
124
125int main() {
126 // Example: two 1D histograms on three bins with Euclidean cost
127 vector<double> a = {0.5, 0.4, 0.1};
128 vector<double> b = {0.3, 0.3, 0.4};
129
130 // Support points (e.g., positions on the line)
131 vector<double> xa = {0.0, 1.0, 2.0};
132 vector<double> xb = {0.0, 1.0, 2.0};
133
134 int n = (int)a.size(), m = (int)b.size();
135 vector<vector<double>> C(n, vector<double>(m));
136 for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) {
137 double d = fabs(xa[i] - xb[j]);
138 C[i][j] = d; // use p=1 cost (you could square it for W2^2)
139 }
140
141 double epsilon = 0.1;
142 auto res = sinkhorn_logdomain(a, b, C, epsilon, 5000, 1e-10);
143
144 cout << fixed << setprecision(6);
145 cout << "Iterations: " << res.iters << "\n";
146 cout << "Unregularized transport cost <C,Pi>: " << res.unreg_cost << "\n";
147 cout << "Regularized objective value: " << res.reg_cost << "\n";
148 cout << "Row L1 violation: " << res.row_violation << ", Col L1 violation: " << res.col_violation << "\n";
149
150 cout << "Transport plan (Pi):\n";
151 for (int i = 0; i < n; ++i) {
152 for (int j = 0; j < m; ++j) cout << res.plan[i][j] << (j+1==m?'\n':' ');
153 }
154
155 return 0;
156}
157

This program solves entropic-regularized OT between two discrete distributions a and b using a numerically stable, log-domain Sinkhorn algorithm. It precomputes logK = −C/ε, iteratively updates log-scaling vectors logu and logv using log-sum-exp reductions, and stops when the marginal constraints are met within tolerance. Finally, it materializes the transport plan Π = diag(u) K diag(v) (implicitly via logs), and reports the transport cost ⟨C, Π⟩ and the regularized objective value. The example uses a 1D grid with L1 ground cost, but any cost matrix can be used.

Time: O(n m I) per run, where I is the number of Sinkhorn iterationsSpace: O(n m) to store C and logK, plus O(n + m) for scaling vectors
Fast 1D Wasserstein-1 (EMD) via CDF Difference Integration
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute W1(μ, ν) in 1D by integrating |F_μ - F_ν| over R.
5// Inputs: points x with masses a, points y with masses b. Masses will be normalized to sum to 1.
6double wasserstein1_1d(vector<double> x, vector<double> a, vector<double> y, vector<double> b) {
7 int n = (int)x.size(), m = (int)y.size();
8 if (n != (int)a.size() || m != (int)b.size()) throw runtime_error("Mismatched sizes");
9
10 auto normalize = [](vector<double>& w){
11 double s = 0.0; for (double v : w) s += v; if (s <= 0) throw runtime_error("Nonpositive mass");
12 for (double& v : w) v /= s;
13 };
14 normalize(a); normalize(b);
15
16 // Pack events: position, type (0 for μ jump, 1 for ν jump), weight
17 struct Ev { double pos; int type; double w; };
18 vector<Ev> evs; evs.reserve(n + m);
19 for (int i = 0; i < n; ++i) evs.push_back({x[i], 0, a[i]});
20 for (int j = 0; j < m; ++j) evs.push_back({y[j], 1, b[j]});
21
22 sort(evs.begin(), evs.end(), [](const Ev& A, const Ev& B){
23 if (A.pos != B.pos) return A.pos < B.pos;
24 return A.type < B.type; // process μ before ν at ties (arbitrary but consistent)
25 });
26
27 // Sweep over positions, integrating |F_μ - F_ν| piecewise
28 double Fmu = 0.0, Fnu = 0.0;
29 double prev = evs.empty() ? 0.0 : evs.front().pos;
30 double w1 = 0.0;
31 size_t k = 0;
32 while (k < evs.size()) {
33 double xk = evs[k].pos;
34 // integrate over [prev, xk)
35 w1 += fabs(Fmu - Fnu) * (xk - prev);
36 // apply all jumps at xk
37 while (k < evs.size() && evs[k].pos == xk) {
38 if (evs[k].type == 0) Fmu += evs[k].w; else Fnu += evs[k].w;
39 ++k;
40 }
41 prev = xk;
42 }
43 // No more events; tails contribute zero because Fμ and Fν both equal 1 for t beyond last support
44
45 return w1;
46}
47
48int main() {
49 // Example: two point-mass distributions on the real line
50 vector<double> x = {0.0, 1.0, 3.0};
51 vector<double> a = {0.2, 0.5, 0.3};
52 vector<double> y = {0.5, 2.0};
53 vector<double> b = {0.6, 0.4};
54
55 double w1 = wasserstein1_1d(x, a, y, b);
56 cout << fixed << setprecision(6);
57 cout << "W1 (1D EMD) between μ and ν: " << w1 << "\n";
58
59 return 0;
60}
61

This program computes the exact 1D Earth Mover’s Distance W_1 by integrating the absolute difference between the CDFs of μ and ν. It normalizes the masses, merges and sorts support points, sweeps from left to right, and accumulates |F_μ − F_ν| times the interval length. This is significantly faster and more accurate in 1D than generic OT solvers.

Time: O((n + m) log(n + m)) due to sortingSpace: O(n + m) additional space for merged events