📚TheoryAdvanced

Optimal Transport Theory

Key Points

•
Optimal Transport (OT) formalizes the cheapest way to move one probability distribution into another given a cost to move mass.
•
The Monge problem looks for a deterministic map T, while the Kantorovich relaxation allows splitting mass via a coupling $π,$ making the problem convex and solvable.
•
The p-Wasserstein distance $W_{p}$ is the OT-based metric that measures how far two distributions are, and it is a true metric under mild conditions.
•
In one dimension, $W_{1}$ can be computed efficiently using cumulative distribution functions, making it very fast.
•
Entropic regularization adds a small entropy term that makes the problem smooth and enables the fast Sinkhorn algorithm.
•
Sinkhorn scales two vectors u and v to match marginals and computes an approximate transport plan Π = diag(u) K diag(v).
•
OT is widely used in ML for domain adaptation, generative modeling (Wasserstein GANs), fairness, and dataset comparison.
•
Practical implementations must ensure valid probability vectors, a well-scaled cost matrix, and numerically stable computations (often log-domain).

Prerequisites

→Probability distributions and expectations — OT compares probability measures and uses expectations in objectives.
→Linear algebra (vectors, matrices, inner products) — Discrete OT uses matrices for cost and transport plans, with constraints expressed via matrix-vector operations.
→Convex optimization and linear programming basics — Kantorovich OT is a convex problem; understanding constraints, duality, and optimality conditions helps.
→Numerical stability (log-sum-exp, normalization) — Sinkhorn requires stable exponentials and careful handling of small/zero masses.
→Metric spaces and norms — Wasserstein distances rely on ground metrics like Euclidean norms and their properties.
→Algorithmic complexity — Choosing between LP, Sinkhorn, or 1D methods depends on time/space trade-offs.

Detailed Explanation

Tap terms for definitions

01Overview

Optimal Transport (OT) studies how to transform one probability distribution into another in the cheapest possible way, where the cost reflects how difficult it is to move mass from one location to another. Imagine two piles of sand shaped differently; moving sand from the first shape into the second shape costs energy proportional to how far you move each grain. Mathematically, OT takes two distributions μ and ν over a space (like points in 1D, 2D, etc.) and a ground cost c(x, y) that quantifies the price to move a unit of mass from x to y. The original Monge problem asks for a transport map T that pushes μ onto ν while minimizing the total cost. Because Monge’s map can fail to exist or be hard to compute, the Kantorovich relaxation allows splitting mass and seeks a coupling π(x, y)—a joint distribution with marginals μ and ν—that minimizes expected cost. A central outcome is the Wasserstein distance W_p, which turns distributions into points in a geometric space where distances reflect how much work is needed to morph one distribution into another. This geometric viewpoint is powerful: W_p is a metric, behaves well under sampling, and captures differences in support and geometry better than, say, KL divergence when supports do not overlap. In practice, entropic regularization yields smooth, fast approximations via the Sinkhorn algorithm, which has become a standard tool for large-scale applications.

02Intuition & Analogies

Think of two skyline outlines made of stacked boxes: one is the current skyline (source distribution), the other is the target skyline (destination distribution). Your job is to slide boxes horizontally to reshape source into target while paying per unit distance moved. If you are forced to move each box as a whole to a single target box (no splitting), that’s like the Monge setting: each piece travels to exactly one destination. Real life is messier—sometimes you split a box to fill multiple gaps—which is precisely what the Kantorovich formulation allows: mass can be split across multiple destinations so the total cost is minimized.

Comparing distributions often fails with traditional divergences when they don’t overlap—like two non-overlapping skylines. KL or JS divergence can saturate or become infinite because they compare heights where one skyline is zero. OT, by contrast, asks: how far, physically, do I need to shift mass to make them match? If two skylines are identical but shifted by 1 unit, OT says the distance is about that 1 unit, which matches our geometric intuition.

Entropic regularization adds a small amount of “fuzziness” so instead of a brittle, sharp plan of where each unit goes, you get a slightly diffused plan. This fuzziness makes the computation much faster because the problem becomes strongly convex and amenable to matrix scaling. Sinkhorn’s algorithm repeatedly normalizes rows and columns of a Gibbs kernel, like adjusting knobs until both source and target totals are right. The end result is a near-optimal plan that respects the geometry while being computationally tractable for large problems.

03Formal Definition

Let (X, d) be a Polish metric space (complete, separable) and

μ,

ν be probability measures on X with finite p-th moments. Given a cost c(x, y) = d(x, y)^p, the Monge problem seeks a measurable map T:

X \to X

that pushes μ to ν (

T_{#}

μ = ν) and minimizes the transport cost: \(

min_{T : T_{#} μ = ν}

\int

c(x, T(x)) \, d

μ

(x)\). This problem may be ill-posed (no feasible T) or non-convex. Kantorovich relaxes the problem to couplings: find a joint probability π on X × X with marginals μ and ν (π ∈ Π(

μ,

ν)) that minimizes \(

\int

c(x,y) \, d

π

(x,y)\). This is a linear program in the discrete case and a convex optimization problem generally. The p-Wasserstein distance is then defined by \(

W_{p}

(

μ

ν

) =

\left

(

in f_{π \in Π (μ, ν)}

\int

d(x,y)^p \, d

π

(x,y)

\right

)^{1/p}\).

W_{p}

is a metric on the space of probability measures with finite p-th moment. For discrete histograms a ∈ Δ^n and b ∈ Δ^m on supports {

x_{i}

} and {

y_{j}

}, with cost matrix

C_{ij} = d (x_{i}

y_{j}

)^p, the Kantorovich problem is: minimize ⟨C, Π⟩ over Π ∈

R_{+}^{n \times m}

such that Π 1 = a and Π^⊤ 1 = b. Entropic regularization adds ε

\sum_{i, j}

Π_{ij}(log Π_{ij} − 1), yielding a strictly convex problem whose solution has the form Π* = diag(u) K diag(v) with

K_{ij} = e x p (- C_{ij} / ε

). Sinkhorn iterations

u \leftarrow a

⊘ (K v),

v \leftarrow b

⊘ (K^⊤ u) compute u and v.

04When to Use

Use Optimal Transport when geometry matters—when points lying near each other should not be penalized as heavily as points far apart. This is especially valuable when supports do not overlap (e.g., two point clouds shifted apart), where divergences like KL may fail or be infinite. Specific scenarios include:

Generative modeling: Wasserstein GANs use a dual form of W_1 to stabilize training by providing meaningful gradients even when supports are disjoint.
Domain adaptation: Align feature distributions across domains by minimizing a Wasserstein distance to reduce covariate shift.
Dataset comparison: Quantify how different datasets are, taking spatial structure into account (e.g., comparing images, word embeddings, or sensor readings).
Fairness and recourse: Match score distributions across demographic groups using OT to assess and reduce disparities while preserving structure.
Clustering and barycenters: Compute Wasserstein barycenters to find a central prototype distribution of multiple datasets.
1D signals and histograms: When data lies on a line (e.g., grayscale intensities), W_1 is very fast via cumulative sums.
Large-scale approximate matching: Entropic regularization plus Sinkhorn enables near-linear algebra style routines that scale to tens or hundreds of thousands of points, especially with structure (e.g., convolutional kernels).

⚠️Common Mistakes

Not normalizing histograms: OT requires equal total mass. Forgetting to normalize a and b (or to handle unmatched mass) leads to infeasible constraints or misleading results.
Cost scaling issues: If C has very large values and ε is small, exp(−C/ε) underflows to zero, breaking Sinkhorn. Rescale features or costs (e.g., divide distances by a characteristic scale) and choose ε consistent with cost magnitudes.
Using ε too small: Pursuing almost-unregularized solutions with tiny ε can cause numerical instability and slow convergence. Prefer a moderate ε first; if needed, use ε-scaling (start large, then decrease) with log-domain stabilization.
Ignoring marginal errors: Stopping Sinkhorn without checking that row/column sums match a, b within tolerance yields infeasible couplings. Monitor marginal residuals and set a clear tolerance.
Confusing assignment with OT: The Hungarian algorithm solves one-to-one matching; general OT allows splitting mass and different histogram sizes. Use linear programming or Sinkhorn for general couplings.
Misinterpreting WGAN dual: The critic must be 1-Lipschitz. Without proper constraints (e.g., gradient penalty), the learned distance is not a true Wasserstein distance.
Overlooking 1D shortcuts: For 1D data, computing W_1 via CDFs is O(n log n), far faster and more accurate than generic LP or Sinkhorn—don’t miss this optimization.
Precision pitfalls: Summing many tiny Π_{ij} terms can accumulate error. Use double precision, stable reductions, and optionally compute in the log-domain where feasible.

Key Formulas

Monge Problem

T : T_{#} μ = ν min E_{x \sim μ} [c (x, T (x))]

Explanation: Find a measurable map T that pushes μ to ν with minimum expected cost. This is often hard or ill-posed because it forbids splitting mass.

Kantorovich Relaxation

π \in Π (μ, ν) min E_{(x, y) \sim π} [c (x, y)]

Explanation: Optimize over couplings π with fixed marginals. This convex relaxation always admits solutions and is the standard practical formulation.

p-Wasserstein Distance

W_{p} (μ, ν) = (π \in Π (μ, ν) in f E [∥ x - y ∥^{p}])^{1/ p}

Explanation: Defines a metric between probability measures with finite p-th moments. It measures the minimum effort to transform one distribution into the other.

Discrete OT Linear Program

Π \geq 0 min ⟨ C, Π ⟩ s.t. Π 1 = a, Π^{⊤} 1 = b

Explanation: For histograms a and b with cost matrix C, choose nonnegative Π to match row/column sums while minimizing total cost. Angle brackets denote the Frobenius inner product.

Kantorovich–Rubinstein Duality

W_{1} (μ, ν) = ∥ f ∥_{Lip} \leq 1 sup (E_{μ} [f (x)] - E_{ν} [f (y)])

Explanation: The 1-Wasserstein distance equals the largest expected difference over all 1-Lipschitz functions. This is the basis of WGAN critics.

1D W_1 via CDFs

W_{1} (μ, ν) = \int_{- \infty}^{\infty} F_{μ} (t) - F_{ν} (t) d t

Explanation: In one dimension, the Earth Mover’s Distance equals the area between CDFs. This enables fast O(n $lo g$ n) algorithms.

Entropic Regularized OT

Π \in Π (a, b) min ⟨ C, Π ⟩ + ε i, j \sum Π_{ij} (lo g Π_{ij} - 1)

Explanation: Adds a negative-entropy term weighted by ε to smooth the optimization and improve computational speed and stability.

Sinkhorn Form

Π^{⋆} = diag (u) K diag (v), K_{ij} = e^{- C_{ij} / ε}

Explanation: The optimal entropic plan factors through left/right scalings of a Gibbs kernel. The vectors u and v are chosen so that row/column sums match a and b.

Sinkhorn Updates

u \leftarrow a ⊘ (K v), v \leftarrow b ⊘ (K^{⊤} u)

Explanation: Elementwise divisions update scaling vectors to enforce the marginal constraints. Each iteration costs a pair of matrix-vector products.

Triangle Inequality

W_{p} (μ, λ) \leq W_{p} (μ, ν) + W_{p} (ν, λ)

Explanation: $W_{p}$ is a true metric; it satisfies non-negativity, symmetry, identity of indiscernibles, and the triangle inequality.

Metrizes Weak Convergence (with moments)

W_{p} (μ_{n}, μ) \to 0 \Rightarrow μ_{n} w μ and \int ∥ x ∥^{p} d μ_{n} \to \int ∥ x ∥^{p} d μ

Explanation: Convergence in $W_{p}$ implies weak convergence plus convergence of p-th moments. This makes $W_{p}$ statistically meaningful for empirical measures.

Sinkhorn Complexity

T (n, m, I) \approx O (nm I)

Explanation: Each Sinkhorn iteration requires two dense matrix-vector products. With I iterations, runtime grows linearly in I and the size of the cost matrix.

Complexity Analysis

Discrete OT as a linear program over an n×m transport matrix has worst-case polynomial complexity (e.g., interior point methods with high constants) and memory O(nm). The assignment special case (

n = m

, one-to-one) can be solved in O(

n^{3}

) by the Hungarian algorithm. In contrast, the entropically-regularized problem is solved via Sinkhorn iterations with per-iteration cost dominated by two matrix-vector products: O(n m) time and O(n m) memory if K is dense. With I iterations to reach tolerance, the overall time is O(n m I). In practice, I scales sublinearly with 1/ε for well-scaled problems;

ε - sc a l in g

(continuation) can further reduce total iterations. If the cost is structured (e.g., grid with separable or convolutional costs), K-vector products can be accelerated via FFTs, reducing per-iteration time to near-linear. Numerical stability is critical. For small ε or large costs, storing

K = e x p (- C / ε

) can underflow; log-domain implementations compute log-sum-exp reductions at O(n m) per iteration with similar constants but greatly improved stability. Space usage is O(n m) to store C (and optionally logK); forming Π explicitly also costs O(n m). If only the cost value is needed, one can avoid materializing Π and compute ∑ Π_{ij}

C_{ij}

on-the-fly. In 1D,

W_{1}

can be computed in O((n + m) log(n + m)) time (for sorting) and O(1) extra space beyond the input, by sweeping over the merged supports and integrating the absolute difference of CDFs. This is orders of magnitude faster and more memory-efficient than generic OT solvers on the line.

Code Examples

Sinkhorn Algorithm (Log-Stabilized) for Entropic OT between Two Histograms

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute log-sum-exp of a vector of terms safely
5 static double logsumexp(const vector<double>& v) {
6     double m = -numeric_limits<double>::infinity();
7     for (double x : v) m = max(m, x);
8     if (!isfinite(m)) return m; // all -inf
9     double s = 0.0;
10     for (double x : v) s += exp(x - m);
11     return m + log(s);
12 }
13 
14 struct SinkhornResult {
15     vector<vector<double>> plan;   // Transport plan Π (n x m)
16     double reg_cost;               // Regularized cost: <C,Π> + ε * sum Π_{ij}(log Π_{ij} - 1)
17     double unreg_cost;             // Transport cost <C,Π>
18     double row_violation;          // ||Π1 - a||_1
19     double col_violation;          // ||Π^T1 - b||_1
20     int iters;                     // Iterations performed
21 };
22 
23 // Entropic OT via log-domain Sinkhorn.
24 // a: size n, b: size m (must be nonnegative and sum to 1), C: n x m cost matrix (nonnegative), epsilon > 0
25 SinkhornResult sinkhorn_logdomain(const vector<double>& a_in,
26                                   const vector<double>& b_in,
27                                   const vector<vector<double>>& C,
28                                   double epsilon = 0.05,
29                                   int max_iters = 10_000,
30                                   double tol = 1e-9) {
31     int n = (int)a_in.size();
32     int m = (int)b_in.size();
33     // Normalize a and b to sum to 1
34     auto a = a_in, b = b_in;
35     auto sumvec = [](const vector<double>& v){ double s=0; for(double x: v) s+=x; return s; };
36     double sa = sumvec(a), sb = sumvec(b);
37     if (sa <= 0 || sb <= 0) throw runtime_error("Input histograms must have positive total mass.");
38     for (double& x : a) x /= sa; for (double& y : b) y /= sb;
39 
40     // Precompute log a and log b, handling zeros as -inf
41     vector<double> loga(n), logb(m);
42     for (int i = 0; i < n; ++i) loga[i] = (a[i] > 0) ? log(a[i]) : -numeric_limits<double>::infinity();
43     for (int j = 0; j < m; ++j) logb[j] = (b[j] > 0) ? log(b[j]) : -numeric_limits<double>::infinity();
44 
45     // logK = -C/epsilon
46     vector<vector<double>> logK(n, vector<double>(m));
47     for (int i = 0; i < n; ++i) {
48         for (int j = 0; j < m; ++j) {
49             logK[i][j] = -C[i][j] / epsilon;
50         }
51     }
52 
53     // Initialize log u and log v to zeros
54     vector<double> logu(n, 0.0), logv(m, 0.0);
55 
56     int it = 0;
57     for (; it < max_iters; ++it) {
58         // Update logu: logu_i = log a_i - logsumexp_j (logK_ij + logv_j)
59         vector<double> new_logu(n);
60         for (int i = 0; i < n; ++i) {
61             vector<double> tmp(m);
62             for (int j = 0; j < m; ++j) tmp[j] = logK[i][j] + logv[j];
63             double lse = logsumexp(tmp);
64             new_logu[i] = loga[i] - lse;
65         }
66         logu.swap(new_logu);
67 
68         // Update logv: logv_j = log b_j - logsumexp_i (logK_ij + logu_i)
69         vector<double> new_logv(m);
70         for (int j = 0; j < m; ++j) {
71             vector<double> tmp(n);
72             for (int i = 0; i < n; ++i) tmp[i] = logK[i][j] + logu[i];
73             double lse = logsumexp(tmp);
74             new_logv[j] = logb[j] - lse;
75         }
76         logv.swap(new_logv);
77 
78         // Check marginal violations every few iterations (here each iteration for clarity)
79         // Compute Π implicitly to get row/col sums:
80         double row_err = 0.0, col_err = 0.0;
81         vector<double> rowsum(n, 0.0), colsum(m, 0.0);
82         for (int i = 0; i < n; ++i) {
83             for (int j = 0; j < m; ++j) {
84                 double logP = logu[i] + logK[i][j] + logv[j];
85                 double P = exp(logP);
86                 rowsum[i] += P;
87                 colsum[j] += P;
88             }
89         }
90         for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
91         for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
92         if (row_err + col_err < tol) break;
93     }
94 
95     // Build explicit plan and compute costs and violations
96     vector<vector<double>> P(n, vector<double>(m));
97     double cost = 0.0;
98     double reg_term = 0.0; // sum Π_{ij}(log Π_{ij} - 1)
99     vector<double> rowsum(n, 0.0), colsum(m, 0.0);
100     for (int i = 0; i < n; ++i) {
101         for (int j = 0; j < m; ++j) {
102             double logP = logu[i] + logK[i][j] + logv[j];
103             double pij = exp(logP);
104             P[i][j] = pij;
105             cost += pij * C[i][j];
106             if (pij > 0) reg_term += pij * (log(pij) - 1.0);
107             rowsum[i] += pij;
108             colsum[j] += pij;
109         }
110     }
111     double row_err = 0.0, col_err = 0.0;
112     for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
113     for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
114 
115     SinkhornResult res;
116     res.plan = move(P);
117     res.unreg_cost = cost;
118     res.reg_cost = cost + epsilon * reg_term;
119     res.row_violation = row_err;
120     res.col_violation = col_err;
121     res.iters = it + 1;
122     return res;
123 }
124 
125 int main() {
126     // Example: two 1D histograms on three bins with Euclidean cost
127     vector<double> a = {0.5, 0.4, 0.1};
128     vector<double> b = {0.3, 0.3, 0.4};
129 
130     // Support points (e.g., positions on the line)
131     vector<double> xa = {0.0, 1.0, 2.0};
132     vector<double> xb = {0.0, 1.0, 2.0};
133 
134     int n = (int)a.size(), m = (int)b.size();
135     vector<vector<double>> C(n, vector<double>(m));
136     for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) {
137         double d = fabs(xa[i] - xb[j]);
138         C[i][j] = d; // use p=1 cost (you could square it for W2^2)
139     }
140 
141     double epsilon = 0.1;
142     auto res = sinkhorn_logdomain(a, b, C, epsilon, 5000, 1e-10);
143 
144     cout << fixed << setprecision(6);
145     cout << "Iterations: " << res.iters << "\n";
146     cout << "Unregularized transport cost <C,Pi>: " << res.unreg_cost << "\n";
147     cout << "Regularized objective value:        " << res.reg_cost << "\n";
148     cout << "Row L1 violation: " << res.row_violation << ", Col L1 violation: " << res.col_violation << "\n";
149 
150     cout << "Transport plan (Pi):\n";
151     for (int i = 0; i < n; ++i) {
152         for (int j = 0; j < m; ++j) cout << res.plan[i][j] << (j+1==m?'\n':' ');
153     }
154 
155     return 0;
156 }
157

This program solves entropic-regularized OT between two discrete distributions a and b using a numerically stable, log-domain Sinkhorn algorithm. It precomputes logK = −C/ε, iteratively updates log-scaling vectors logu and logv using log-sum-exp reductions, and stops when the marginal constraints are met within tolerance. Finally, it materializes the transport plan Π = diag(u) K diag(v) (implicitly via logs), and reports the transport cost ⟨C, Π⟩ and the regularized objective value. The example uses a 1D grid with L1 ground cost, but any cost matrix can be used.

Time: O(n m I) per run, where I is the number of Sinkhorn iterationsSpace: O(n m) to store C and logK, plus O(n + m) for scaling vectors

Fast 1D Wasserstein-1 (EMD) via CDF Difference Integration

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute W1(μ, ν) in 1D by integrating |F_μ - F_ν| over R.
5 // Inputs: points x with masses a, points y with masses b. Masses will be normalized to sum to 1.
6 double wasserstein1_1d(vector<double> x, vector<double> a, vector<double> y, vector<double> b) {
7     int n = (int)x.size(), m = (int)y.size();
8     if (n != (int)a.size() || m != (int)b.size()) throw runtime_error("Mismatched sizes");
9 
10     auto normalize = [](vector<double>& w){
11         double s = 0.0; for (double v : w) s += v; if (s <= 0) throw runtime_error("Nonpositive mass");
12         for (double& v : w) v /= s;
13     };
14     normalize(a); normalize(b);
15 
16     // Pack events: position, type (0 for μ jump, 1 for ν jump), weight
17     struct Ev { double pos; int type; double w; };
18     vector<Ev> evs; evs.reserve(n + m);
19     for (int i = 0; i < n; ++i) evs.push_back({x[i], 0, a[i]});
20     for (int j = 0; j < m; ++j) evs.push_back({y[j], 1, b[j]});
21 
22     sort(evs.begin(), evs.end(), [](const Ev& A, const Ev& B){
23         if (A.pos != B.pos) return A.pos < B.pos;
24         return A.type < B.type; // process μ before ν at ties (arbitrary but consistent)
25     });
26 
27     // Sweep over positions, integrating |F_μ - F_ν| piecewise
28     double Fmu = 0.0, Fnu = 0.0;
29     double prev = evs.empty() ? 0.0 : evs.front().pos;
30     double w1 = 0.0;
31     size_t k = 0;
32     while (k < evs.size()) {
33         double xk = evs[k].pos;
34         // integrate over [prev, xk)
35         w1 += fabs(Fmu - Fnu) * (xk - prev);
36         // apply all jumps at xk
37         while (k < evs.size() && evs[k].pos == xk) {
38             if (evs[k].type == 0) Fmu += evs[k].w; else Fnu += evs[k].w;
39             ++k;
40         }
41         prev = xk;
42     }
43     // No more events; tails contribute zero because Fμ and Fν both equal 1 for t beyond last support
44 
45     return w1;
46 }
47 
48 int main() {
49     // Example: two point-mass distributions on the real line
50     vector<double> x = {0.0, 1.0, 3.0};
51     vector<double> a = {0.2, 0.5, 0.3};
52     vector<double> y = {0.5, 2.0};
53     vector<double> b = {0.6, 0.4};
54 
55     double w1 = wasserstein1_1d(x, a, y, b);
56     cout << fixed << setprecision(6);
57     cout << "W1 (1D EMD) between μ and ν: " << w1 << "\n";
58 
59     return 0;
60 }
61

This program computes the exact 1D Earth Mover’s Distance W_1 by integrating the absolute difference between the CDFs of μ and ν. It normalizes the masses, merges and sorts support points, sweeps from left to right, and accumulates |F_μ − F_ν| times the interval length. This is significantly faster and more accurate in 1D than generic OT solvers.

Time: O((n + m) log(n + m)) due to sortingSpace: O(n + m) additional space for merged events

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute log-sum-exp of a vector of terms safely
5	static double logsumexp(const vector<double>& v) {
6	double m = -numeric_limits<double>::infinity();
7	for (double x : v) m = max(m, x);
8	if (!isfinite(m)) return m; // all -inf
9	double s = 0.0;
10	for (double x : v) s += exp(x - m);
11	return m + log(s);
12	}
13
14	struct SinkhornResult {
15	vector<vector<double>> plan; // Transport plan Π (n x m)
16	double reg_cost; // Regularized cost: <C,Π> + ε * sum Π_{ij}(log Π_{ij} - 1)
17	double unreg_cost; // Transport cost <C,Π>
18	double row_violation; // \|\|Π1 - a\|\|_1
19	double col_violation; // \|\|Π^T1 - b\|\|_1
20	int iters; // Iterations performed
21	};
22
23	// Entropic OT via log-domain Sinkhorn.
24	// a: size n, b: size m (must be nonnegative and sum to 1), C: n x m cost matrix (nonnegative), epsilon > 0
25	SinkhornResult sinkhorn_logdomain(const vector<double>& a_in,
26	const vector<double>& b_in,
27	const vector<vector<double>>& C,
28	double epsilon = 0.05,
29	int max_iters = 10_000,
30	double tol = 1e-9) {
31	int n = (int)a_in.size();
32	int m = (int)b_in.size();
33	// Normalize a and b to sum to 1
34	auto a = a_in, b = b_in;
35	auto sumvec = [](const vector<double>& v){ double s=0; for(double x: v) s+=x; return s; };
36	double sa = sumvec(a), sb = sumvec(b);
37	if (sa <= 0 \|\| sb <= 0) throw runtime_error("Input histograms must have positive total mass.");
38	for (double& x : a) x /= sa; for (double& y : b) y /= sb;
39
40	// Precompute log a and log b, handling zeros as -inf
41	vector<double> loga(n), logb(m);
42	for (int i = 0; i < n; ++i) loga[i] = (a[i] > 0) ? log(a[i]) : -numeric_limits<double>::infinity();
43	for (int j = 0; j < m; ++j) logb[j] = (b[j] > 0) ? log(b[j]) : -numeric_limits<double>::infinity();
44
45	// logK = -C/epsilon
46	vector<vector<double>> logK(n, vector<double>(m));
47	for (int i = 0; i < n; ++i) {
48	for (int j = 0; j < m; ++j) {
49	logK[i][j] = -C[i][j] / epsilon;
50	}
51	}
52
53	// Initialize log u and log v to zeros
54	vector<double> logu(n, 0.0), logv(m, 0.0);
55
56	int it = 0;
57	for (; it < max_iters; ++it) {
58	// Update logu: logu_i = log a_i - logsumexp_j (logK_ij + logv_j)
59	vector<double> new_logu(n);
60	for (int i = 0; i < n; ++i) {
61	vector<double> tmp(m);
62	for (int j = 0; j < m; ++j) tmp[j] = logK[i][j] + logv[j];
63	double lse = logsumexp(tmp);
64	new_logu[i] = loga[i] - lse;
65	}
66	logu.swap(new_logu);
67
68	// Update logv: logv_j = log b_j - logsumexp_i (logK_ij + logu_i)
69	vector<double> new_logv(m);
70	for (int j = 0; j < m; ++j) {
71	vector<double> tmp(n);
72	for (int i = 0; i < n; ++i) tmp[i] = logK[i][j] + logu[i];
73	double lse = logsumexp(tmp);
74	new_logv[j] = logb[j] - lse;
75	}
76	logv.swap(new_logv);
77
78	// Check marginal violations every few iterations (here each iteration for clarity)
79	// Compute Π implicitly to get row/col sums:
80	double row_err = 0.0, col_err = 0.0;
81	vector<double> rowsum(n, 0.0), colsum(m, 0.0);
82	for (int i = 0; i < n; ++i) {
83	for (int j = 0; j < m; ++j) {
84	double logP = logu[i] + logK[i][j] + logv[j];
85	double P = exp(logP);
86	rowsum[i] += P;
87	colsum[j] += P;
88	}
89	}
90	for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
91	for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
92	if (row_err + col_err < tol) break;
93	}
94
95	// Build explicit plan and compute costs and violations
96	vector<vector<double>> P(n, vector<double>(m));
97	double cost = 0.0;
98	double reg_term = 0.0; // sum Π_{ij}(log Π_{ij} - 1)
99	vector<double> rowsum(n, 0.0), colsum(m, 0.0);
100	for (int i = 0; i < n; ++i) {
101	for (int j = 0; j < m; ++j) {
102	double logP = logu[i] + logK[i][j] + logv[j];
103	double pij = exp(logP);
104	P[i][j] = pij;
105	cost += pij * C[i][j];
106	if (pij > 0) reg_term += pij * (log(pij) - 1.0);
107	rowsum[i] += pij;
108	colsum[j] += pij;
109	}
110	}
111	double row_err = 0.0, col_err = 0.0;
112	for (int i = 0; i < n; ++i) row_err += fabs(rowsum[i] - a[i]);
113	for (int j = 0; j < m; ++j) col_err += fabs(colsum[j] - b[j]);
114
115	SinkhornResult res;
116	res.plan = move(P);
117	res.unreg_cost = cost;
118	res.reg_cost = cost + epsilon * reg_term;
119	res.row_violation = row_err;
120	res.col_violation = col_err;
121	res.iters = it + 1;
122	return res;
123	}
124
125	int main() {
126	// Example: two 1D histograms on three bins with Euclidean cost
127	vector<double> a = {0.5, 0.4, 0.1};
128	vector<double> b = {0.3, 0.3, 0.4};
129
130	// Support points (e.g., positions on the line)
131	vector<double> xa = {0.0, 1.0, 2.0};
132	vector<double> xb = {0.0, 1.0, 2.0};
133
134	int n = (int)a.size(), m = (int)b.size();
135	vector<vector<double>> C(n, vector<double>(m));
136	for (int i = 0; i < n; ++i) for (int j = 0; j < m; ++j) {
137	double d = fabs(xa[i] - xb[j]);
138	C[i][j] = d; // use p=1 cost (you could square it for W2^2)
139	}
140
141	double epsilon = 0.1;
142	auto res = sinkhorn_logdomain(a, b, C, epsilon, 5000, 1e-10);
143
144	cout << fixed << setprecision(6);
145	cout << "Iterations: " << res.iters << "\n";
146	cout << "Unregularized transport cost <C,Pi>: " << res.unreg_cost << "\n";
147	cout << "Regularized objective value: " << res.reg_cost << "\n";
148	cout << "Row L1 violation: " << res.row_violation << ", Col L1 violation: " << res.col_violation << "\n";
149
150	cout << "Transport plan (Pi):\n";
151	for (int i = 0; i < n; ++i) {
152	for (int j = 0; j < m; ++j) cout << res.plan[i][j] << (j+1==m?'\n':' ');
153	}
154
155	return 0;
156	}
157

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute W1(μ, ν) in 1D by integrating \|F_μ - F_ν\| over R.
5	// Inputs: points x with masses a, points y with masses b. Masses will be normalized to sum to 1.
6	double wasserstein1_1d(vector<double> x, vector<double> a, vector<double> y, vector<double> b) {
7	int n = (int)x.size(), m = (int)y.size();
8	if (n != (int)a.size() \|\| m != (int)b.size()) throw runtime_error("Mismatched sizes");
9
10	auto normalize = [](vector<double>& w){
11	double s = 0.0; for (double v : w) s += v; if (s <= 0) throw runtime_error("Nonpositive mass");
12	for (double& v : w) v /= s;
13	};
14	normalize(a); normalize(b);
15
16	// Pack events: position, type (0 for μ jump, 1 for ν jump), weight
17	struct Ev { double pos; int type; double w; };
18	vector<Ev> evs; evs.reserve(n + m);
19	for (int i = 0; i < n; ++i) evs.push_back({x[i], 0, a[i]});
20	for (int j = 0; j < m; ++j) evs.push_back({y[j], 1, b[j]});
21
22	sort(evs.begin(), evs.end(), [](const Ev& A, const Ev& B){
23	if (A.pos != B.pos) return A.pos < B.pos;
24	return A.type < B.type; // process μ before ν at ties (arbitrary but consistent)
25	});
26
27	// Sweep over positions, integrating \|F_μ - F_ν\| piecewise
28	double Fmu = 0.0, Fnu = 0.0;
29	double prev = evs.empty() ? 0.0 : evs.front().pos;
30	double w1 = 0.0;
31	size_t k = 0;
32	while (k < evs.size()) {
33	double xk = evs[k].pos;
34	// integrate over [prev, xk)
35	w1 += fabs(Fmu - Fnu) * (xk - prev);
36	// apply all jumps at xk
37	while (k < evs.size() && evs[k].pos == xk) {
38	if (evs[k].type == 0) Fmu += evs[k].w; else Fnu += evs[k].w;
39	++k;
40	}
41	prev = xk;
42	}
43	// No more events; tails contribute zero because Fμ and Fν both equal 1 for t beyond last support
44
45	return w1;
46	}
47
48	int main() {
49	// Example: two point-mass distributions on the real line
50	vector<double> x = {0.0, 1.0, 3.0};
51	vector<double> a = {0.2, 0.5, 0.3};
52	vector<double> y = {0.5, 2.0};
53	vector<double> b = {0.6, 0.4};
54
55	double w1 = wasserstein1_1d(x, a, y, b);
56	cout << fixed << setprecision(6);
57	cout << "W1 (1D EMD) between μ and ν: " << w1 << "\n";
58
59	return 0;
60	}
61