📚TheoryIntermediate

Triplet Loss & Contrastive Loss

Key Points

•
Triplet loss and contrastive loss are metric-learning objectives that teach a model to map similar items close together and dissimilar items far apart in an embedding space.
•
Triplet loss uses an anchor, a positive (same class), and a negative (different class) and enforces a margin between positive and negative distances.
•
Contrastive loss operates on pairs labeled as similar or dissimilar and applies a margin only to dissimilar pairs.
•
Choosing a distance metric (e.g., Euclidean or cosine) and deciding whether to L2-normalize embeddings critically affects performance.
•
Mining or sampling informative pairs/triplets is often more important than larger batches for effective learning.
•
Margins control a safety buffer; too small yields collapsed embeddings, too large causes training to stall.
•
Both losses are differentiable almost everywhere and work well with gradient-based optimization.
•
Computational cost is linear in batch size times dimension for forward pass, but can be quadratic if you form all pairs/triplets for mining.

Prerequisites

→Vector norms and inner products — Distances like Euclidean and cosine are defined using norms and dot products.
→Basic calculus and gradients — Both losses require computing gradients for optimization.
→Linear algebra (matrices, matrix-vector multiplication) — Implementing and understanding embeddings often uses linear transforms.
→Optimization (SGD/Adam) and learning rate tuning — Training with these losses relies on gradient-based optimization and hyperparameter tuning.
→Supervised learning basics — You must understand labeled data, batches, and loss minimization.
→Sampling and data batching — Effective training requires constructing informative pairs/triplets within batches.

Detailed Explanation

Tap terms for definitions

01Overview

Triplet loss and contrastive loss are core objectives in metric learning, a paradigm where we learn a function that maps raw inputs (images, audio, text) to vectors (embeddings) such that semantic similarity corresponds to geometric closeness. Instead of predicting labels directly, models trained with these losses learn an embedding space where distances capture meaning. Triplet loss uses triplets of examples: an anchor (a), a positive (p) from the same class or identity, and a negative (n) from a different class. It encourages the anchor to be closer to the positive than to the negative by at least a margin. Contrastive loss works with pairs labeled as similar or dissimilar; it pulls similar pairs together and pushes dissimilar pairs apart past a margin. These objectives power applications like face recognition, product search, speaker verification, and deduplication, where you need to compare unseen items at test time via nearest neighbors rather than fixed class logits. A crucial practical element is the choice of distance (Euclidean vs. cosine) and embedding normalization. Another is mining: not all pairs/triplets are informative, and training focuses faster if we choose hard or semi-hard examples. Proper batching, margin tuning, and regularization (e.g., L2-normalization) combine to create stable, discriminative embeddings.

02Intuition & Analogies

Imagine organizing a large photo album of your friends and their families without writing names on the back. You want photos of the same person to end up close to each other and photos of different people to be apart on your table. With contrastive loss, you pick two photos at a time and decide: put them closer if they are the same person, or keep them at least a fixed distance apart if they are different. With triplet loss, you always consider three: an anchor photo, a known-matching positive photo, and a known-nonmatching negative. Your rule is simple: the anchor must be closer to the positive than to the negative by some safety cushion (the margin). If that rule is already satisfied, you do nothing; if not, you adjust their positions. The margin acts like a buffer zone preventing borderline confusion—think of leaving a seat between strangers in a theater to avoid awkwardness; too small and you’ll bump elbows, too large and you’ll waste space. Choosing a distance is like choosing whether you care about absolute position (Euclidean) or direction (cosine). If you only care that two photos line up in the same direction regardless of scale, you normalize them and use cosine distance. Over time, these small nudges carve the table into islands of identity, where items of the same kind cluster together and different kinds are clearly separated. The trickiest part is choosing which examples to adjust next: moving already well-separated items is pointless, while moving the truly confusing ones yields the biggest improvement.

03Formal Definition

Let f_

θ

X

\to

R^{d}

be an embedding function parameterized by

θ

. Let d(

\cdot

\cdot

) be a distance, commonly Euclidean d(u,v)=

∥

u-v

∥_{2}

or cosine distance

d_{c o s}

(u,v)=1-

\frac{u \cdot v}{∥ u ∥ _{2} ∥ v ∥ _{2}}

. For a triplet (a,p,n) with a and p from the same class, and n from a different class, the triplet loss with margin

m > 0

is L_{

triplet

}(a,p,n)=

max

\{0,\; d(f_

θ

(a),f_

θ

(p)) - d(f_

θ

(a),f_

θ

(n)) + m\}. Often, the squared Euclidean distance

d^{2}

(u,v)=

∥

u-v

∥_{2}^{2}

is used for smoother gradients. For a labeled pair (

x_{i}

x_{j}

) with

y_{ij} \in

\{0,1\} indicating similarity (1 means similar), the contrastive loss with margin

m > 0

is L_{

contrast

}(

x_{i}

x_{j}

y_{ij}

\,d(f_

θ

(

x_{i}

),f_

θ

(

x_{j}

))^2 + (1-

y_{ij}

)\,[

max

\{0, m - d(f_

θ

(

x_{i}

),f_

θ

(

x_{j}

))\}]^2. Batch losses average over samples: L=

\frac{1}{B} \sum_{b = 1}^{B}

L_{b}

. These objectives are piecewise smooth and subdifferentiable due to the hinge

max

(0,

\cdot

). Gradients are nonzero only when constraints are violated: for triplet loss when d(a,p) - d(a,n) +

m > 0;

for contrastive loss when a pair is similar (always active) or a dissimilar pair is within the margin.

04When to Use

Use triplet or contrastive loss when your downstream task is retrieval, verification, clustering, or one-shot/few-shot recognition—any scenario where you will compare embeddings at inference time instead of predicting a closed-set label. Triplet loss shines when identity-level constraints are available, e.g., face recognition: given (anchor, same person, different person), it learns a margin-based separation. Contrastive loss is natural in verification tasks where you can label pairs as same/different, such as signature verification, speaker verification, or product deduplication. When your classes are numerous and evolve over time, metric learning generalizes well: new identities can be handled with nearest neighbors in the learned space without retraining a final classifier layer. Choose cosine distance (often with L2-normalized embeddings) when direction matters more than magnitude or to stabilize training across varying scales; choose Euclidean distance when absolute differences and cluster radii are meaningful. Prefer semi-hard or hard mining when naive random sampling yields mostly easy, uninformative pairs/triplets; but avoid only-the-hardest mining early in training to prevent collapse. If your labels are scarce but pairwise constraints are cheap (e.g., clicks, co-occurrence), contrastive setups can leverage them effectively.

⚠️Common Mistakes

• Mixing up label conventions in contrastive loss: some implementations use y=1 for similar, others for dissimilar. Using the wrong convention silently inverts gradients. Always verify with a small hand-crafted example. • Inconsistent distance and normalization: using cosine distance without L2-normalization or comparing squared Euclidean distances to a margin tuned for unsquared distances leads to mismatched scales and unstable training. • Margin mis-tuning: too small a margin yields overlapping clusters; too large stops learning because the hinge is rarely active. Sweep margins and monitor the fraction of active constraints. • Poor sampling/mining: random pairs/triplets quickly become easy, causing vanishing gradients; exclusively hardest negatives can introduce label noise and collapse early. Use semi-hard mining or distance-weighted sampling. • Data leakage across identities: ensure that identities in validation/test are disjoint from training to get realistic generalization metrics. • Not balancing classes: in imbalanced datasets, positives may be rare; without careful sampling, the model may focus mostly on negatives and underfit positives. • Ignoring numerical stability: dividing by very small norms when computing cosine distance can blow up; add an epsilon. • Forgetting to average per-batch: summing losses without normalization makes learning rate dependent on batch size, complicating tuning.

Key Formulas

Triplet Loss

L_{triplet} (a, p, n) = max {0, d (f_{θ} (a), f_{θ} (p)) - d (f_{θ} (a), f_{θ} (n)) + m}

Explanation: For each triplet, the loss is zero if the anchor is already closer to the positive than to the negative by at least margin m; otherwise it increases linearly with the violation.

Contrastive Loss

L_{contrast} (x_{i}, x_{j}) = y_{ij} d (f_{θ} (x_{i}), f_{θ} (x_{j}))^{2} + (1 - y_{ij}) [max {0, m - d (f_{θ} (x_{i}), f_{θ} (x_{j}))}]^{2}

Explanation: Similar pairs are pulled together via squared distance, while dissimilar pairs are pushed apart until they are at least m apart.

Euclidean and Squared Euclidean Distance

d_{2} (u, v) = ∥ u - v ∥_{2} = k = 1 \sum d (u_{k} - v_{k})^{2}, d_{2}^{2} (u, v) = k = 1 \sum d (u_{k} - v_{k})^{2}

Explanation: Euclidean distance measures straight-line distance; the squared form removes the square root, yielding smoother gradients and lower computational cost.

Cosine Distance

d_{c o s} (u, v) = 1 - \frac{u ^{⊤} v}{∥ u ∥ _{2} ∥ v ∥ _{2} + ε}

Explanation: Cosine distance focuses on the angle between vectors, often used with L2-normalized embeddings; epsilon avoids division by zero.

Hinge Operator

[z]_{+} = max {0, z}

Explanation: The hinge keeps only positive violations; if a constraint is satisfied (z <= 0), it contributes nothing to the loss or gradient.

Gradients of Squared Distance

\frac{\partial}{\partial a} ∥ a - p ∥_{2}^{2} = 2 (a - p), \frac{\partial}{\partial p} ∥ a - p ∥_{2}^{2} = - 2 (a - p)

Explanation: Useful for implementing backpropagation with squared Euclidean distances; gradients point from one vector to the other.

Active Triplet Gradients (Squared Euclidean)

\nabla_{a} L = 2 (a - p) - 2 (a - n), \nabla_{p} L = - 2 (a - p), \nabla_{n} L = 2 (a - n) if d^{2} (a, p) - d^{2} (a, n) + m > 0

Explanation: When the triplet constraint is violated, these are the gradients for anchor, positive, and negative embeddings; otherwise gradients are zero.

Batch Averaging

L_{batch} = \frac{1}{B} i = 1 \sum B L_{i}

Explanation: Averages individual losses over a batch to keep the scale consistent across batch sizes, simplifying learning rate tuning.

Complexity Analysis

Assume embeddings have dimension d and a batch contains B anchors (with their positives and negatives for triplet loss) or B pairs for contrastive loss. Computing squared Euclidean distances between two d-dimensional vectors costs O(d) time and O(1) extra space. Therefore, a naive forward pass for triplet loss over B triplets is O(B d) time and O(1) additional space beyond storing the embeddings. Similarly, contrastive loss over B pairs is O(B d) time. If you construct all pairwise distances in a batch to perform mining (e.g., batch-all or batch-hard), you must compute a B×B distance matrix, which costs O(

B^{2}

d) time and O(

B^{2}

) space to store. This quadratic overhead often dominates, so practitioners cap batch sizes or compute distances on the fly without full materialization. Backpropagation doubles the constant factors but preserves the same asymptotic complexity: gradients for squared distances are O(d) per pair/triplet, leading to O(B d) without pairwise mining and O(

B^{2}

d) with all-pairs mining. When training a linear embedding f(x)=Wx with output dimension k, the gradient with respect to W is an outer product of size k×

d_{i} n

per example, so accumulating over B items is O(B k

d_{i} n

) time and O(k

d_{i} n

) space. Memory-wise, storing B embeddings costs O(B d). Enabling L2-normalization adds an O(d) pass per vector. Using cosine distance adds normalization and dot-product overhead but remains O(d). Overall, the practical bottlenecks are batch-all mining (quadratic), large d, and the cost of loading data; careful batching and semi-hard mining mitigate these costs.

Code Examples

Triplet Loss (Squared Euclidean) with Optional L2-Normalization

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Small epsilon for numerical stability
5 const double EPS = 1e-12;
6 
7 // L2-normalize a vector in-place
8 void l2_normalize(vector<double>& v) {
9     double norm = 0.0;
10     for (double x : v) norm += x * x;
11     norm = sqrt(max(norm, EPS));
12     for (double& x : v) x /= norm;
13 }
14 
15 // Squared Euclidean distance between two vectors
16 double sqeuclidean(const vector<double>& a, const vector<double>& b) {
17     assert(a.size() == b.size());
18     double s = 0.0;
19     for (size_t i = 0; i < a.size(); ++i) {
20         double d = a[i] - b[i];
21         s += d * d;
22     }
23     return s;
24 }
25 
26 // Triplet loss for a batch of (anchor, positive, negative) embeddings
27 // If use_l2_norm is true, normalize each embedding before computing distances
28 // Returns average loss over the batch
29 double triplet_loss_batch(vector<vector<double>> A,
30                           vector<vector<double>> P,
31                           vector<vector<double>> N,
32                           double margin,
33                           bool use_l2_norm = false) {
34     size_t B = A.size();
35     assert(P.size() == B && N.size() == B);
36     if (B == 0) return 0.0;
37 
38     // Optionally normalize in copies (we pass by value to keep originals intact)
39     if (use_l2_norm) {
40         for (size_t i = 0; i < B; ++i) {
41             l2_normalize(A[i]);
42             l2_normalize(P[i]);
43             l2_normalize(N[i]);
44         }
45     }
46 
47     double loss_sum = 0.0;
48     for (size_t i = 0; i < B; ++i) {
49         double dap = sqeuclidean(A[i], P[i]);
50         double dan = sqeuclidean(A[i], N[i]);
51         double violation = dap - dan + margin; // hinge argument
52         if (violation > 0.0) loss_sum += violation; // max(0, violation)
53     }
54     return loss_sum / static_cast<double>(B);
55 }
56 
57 int main() {
58     // Tiny demo with 3-D embeddings
59     vector<vector<double>> A = {{1.0, 0.9, 1.1}, {0.0, 0.1, -0.1}};   // anchors
60     vector<vector<double>> P = {{1.1, 1.0, 1.0}, {0.05, 0.1, 0.0}};   // positives (same class)
61     vector<vector<double>> N = {{3.0, -2.0, 0.0}, {-1.0, 1.0, 1.0}};  // negatives (different class)
62 
63     double margin = 0.5;
64 
65     double L_no_norm = triplet_loss_batch(A, P, N, margin, false);
66     double L_norm    = triplet_loss_batch(A, P, N, margin, true);
67 
68     cout << fixed << setprecision(6);
69     cout << "Triplet loss (no L2-norm):  " << L_no_norm << "\n";
70     cout << "Triplet loss (with L2-norm): " << L_norm << "\n";
71     return 0;
72 }
73

Computes the average triplet loss over a batch using squared Euclidean distance. Optional L2-normalization stabilizes scale and approximates cosine-based behavior. The hinge is active only when the positive is not sufficiently closer than the negative by the margin.

Time: O(B d) for B triplets of dimension dSpace: O(B d) to hold inputs; O(1) extra if operating in place (this version copies if normalization is enabled)

Contrastive Loss (Pairs) with Margin

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 const double EPS = 1e-12;
5 
6 double sqeuclidean(const vector<double>& a, const vector<double>& b) {
7     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) { double d = a[i]-b[i]; s += d*d; } return s; }
8 
9 double contrastive_loss_batch(const vector<vector<double>>& X,
10                               const vector<vector<double>>& Y,
11                               const vector<int>& similar, // 1 for similar, 0 for dissimilar
12                               double margin) {
13     size_t B = X.size();
14     assert(Y.size() == B && similar.size() == B);
15     double loss_sum = 0.0;
16     for (size_t i = 0; i < B; ++i) {
17         double d2 = sqeuclidean(X[i], Y[i]);
18         double d = sqrt(max(d2, EPS));
19         if (similar[i] == 1) {
20             // Pull similar pairs together (squared distance)
21             loss_sum += d2;
22         } else {
23             // Push dissimilar pairs apart until at least 'margin' away
24             double term = max(0.0, margin - d);
25             loss_sum += term * term;
26         }
27     }
28     return loss_sum / static_cast<double>(B);
29 }
30 
31 int main() {
32     // Two-dimensional toy embeddings
33     vector<vector<double>> X = {{0.0, 0.0}, {1.0, 1.0}, {2.0, 2.0}, {5.0, 5.0}};
34     vector<vector<double>> Y = {{0.1, -0.1}, {1.2, 1.1}, {3.0, 2.8}, {4.6, 5.2}};
35     // Labels: 1 = similar, 0 = dissimilar
36     vector<int> sim = {1, 1, 0, 0};
37 
38     double margin = 1.0;
39     double L = contrastive_loss_batch(X, Y, sim, margin);
40     cout << fixed << setprecision(6) << "Contrastive loss: " << L << "\n";
41     return 0;
42 }
43

Implements standard contrastive loss. Similar pairs contribute their squared distance; dissimilar pairs contribute the squared hinge on (margin - distance), encouraging them to be at least the margin apart.

Time: O(B d) for B pairs of dimension dSpace: O(B d) to hold inputs and O(1) extra workspace

Mini Training Loop: Learn a Linear Embedding with Triplet Loss (with Gradients)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Utility: squared Euclidean distance
5 double sqeuclidean(const vector<double>& a, const vector<double>& b) {
6     double s = 0.0; for (size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d; } return s; }
7 
8 // Matrix-vector multiply: y = W * x, where W is (k x n)
9 vector<double> matvec(const vector<vector<double>>& W, const vector<double>& x) {
10     size_t k = W.size(); size_t n = W[0].size();
11     vector<double> y(k, 0.0);
12     for (size_t i = 0; i < k; ++i) {
13         double s = 0.0; for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j]; y[i] = s;
14     }
15     return y;
16 }
17 
18 // Outer product: G += g * x^T (accumulate into G)
19 void outer_add(vector<vector<double>>& G, const vector<double>& g, const vector<double>& x) {
20     for (size_t i = 0; i < G.size(); ++i)
21         for (size_t j = 0; j < G[0].size(); ++j)
22             G[i][j] += g[i] * x[j];
23 }
24 
25 // Compute triplet loss and gradients w.r.t. embeddings (squared Euclidean)
26 // Returns loss; fills ga, gp, gn with gradients when active, else zeros
27 double triplet_loss_embed_grad(const vector<double>& a,
28                                const vector<double>& p,
29                                const vector<double>& n,
30                                double margin,
31                                vector<double>& ga,
32                                vector<double>& gp,
33                                vector<double>& gn) {
34     double dap = sqeuclidean(a, p);
35     double dan = sqeuclidean(a, n);
36     double v = dap - dan + margin;
37     fill(ga.begin(), ga.end(), 0.0);
38     fill(gp.begin(), gp.end(), 0.0);
39     fill(gn.begin(), gn.end(), 0.0);
40     if (v > 0.0) {
41         // Gradients for squared Euclidean distances
42         for (size_t i = 0; i < a.size(); ++i) {
43             ga[i] = 2.0 * (a[i] - p[i]) - 2.0 * (a[i] - n[i]);
44             gp[i] = -2.0 * (a[i] - p[i]);
45             gn[i] =  2.0 * (a[i] - n[i]);
46         }
47         return v;
48     }
49     return 0.0;
50 }
51 
52 int main() {
53     ios::sync_with_stdio(false);
54     cin.tie(nullptr);
55 
56     // Create a tiny 2D toy dataset with two classes
57     // Class 0 around (0,0); Class 1 around (3,3)
58     std::mt19937 rng(42);
59     std::normal_distribution<double> noise(0.0, 0.2);
60     vector<pair<vector<double>, int>> data; // (x, label)
61     for (int i = 0; i < 20; ++i) {
62         vector<double> x0 = {noise(rng), noise(rng)};
63         vector<double> x1 = {3.0 + noise(rng), 3.0 + noise(rng)};
64         data.push_back({x0, 0});
65         data.push_back({x1, 1});
66     }
67 
68     // Linear embedding W: k x n (here k=2, n=2)
69     vector<vector<double>> W = {{1.0, 0.0}, {0.0, 1.0}}; // start as identity
70 
71     double lr = 0.05;      // learning rate
72     double margin = 0.5;   // triplet margin
73     int epochs = 50;
74     int triplets_per_epoch = 200;
75 
76     std::uniform_int_distribution<int> idx_dist(0, (int)data.size() - 1);
77 
78     for (int epoch = 1; epoch <= epochs; ++epoch) {
79         // Accumulate loss for logging
80         double epoch_loss = 0.0;
81         // Zero gradient accumulator for W
82         vector<vector<double>> dW(W.size(), vector<double>(W[0].size(), 0.0));
83 
84         for (int t = 0; t < triplets_per_epoch; ++t) {
85             // Sample an anchor and find a positive and a negative
86             int ia = idx_dist(rng);
87             int la = data[ia].second;
88             // Positive: resample until same label but different index
89             int ip = ia;
90             while (ip == ia || data[ip].second != la) ip = idx_dist(rng);
91             // Negative: resample until different label
92             int in = ia;
93             while (data[in].second == la) in = idx_dist(rng);
94 
95             const vector<double>& xa = data[ia].first;
96             const vector<double>& xp = data[ip].first;
97             const vector<double>& xn = data[in].first;
98 
99             // Forward: embeddings
100             vector<double> ea = matvec(W, xa);
101             vector<double> ep = matvec(W, xp);
102             vector<double> en = matvec(W, xn);
103 
104             // Backward: gradients w.r.t. embeddings
105             vector<double> ga(ea.size()), gp(ep.size()), gn(en.size());
106             double L = triplet_loss_embed_grad(ea, ep, en, margin, ga, gp, gn);
107             epoch_loss += L;
108 
109             if (L > 0.0) {
110                 // Chain rule: dW += ga * xa^T + gp * xp^T + gn * xn^T
111                 outer_add(dW, ga, xa);
112                 outer_add(dW, gp, xp);
113                 outer_add(dW, gn, xn);
114             }
115         }
116 
117         // Average gradients and update W (simple SGD)
118         double scale = 1.0 / static_cast<double>(triplets_per_epoch);
119         for (size_t i = 0; i < W.size(); ++i)
120             for (size_t j = 0; j < W[0].size(); ++j)
121                 W[i][j] -= lr * scale * dW[i][j];
122 
123         cout << "Epoch " << epoch << ": avg loss per triplet = "
124              << fixed << setprecision(6) << (epoch_loss / triplets_per_epoch) << "\n";
125     }
126 
127     // Show learned W
128     cout << "Learned W:\n";
129     for (auto& row : W) {
130         for (double w : row) cout << w << ' ';
131         cout << '\n';
132     }
133 
134     return 0;
135 }
136

Trains a small linear embedding with triplet loss on a synthetic 2D dataset. It derives explicit gradients for squared Euclidean distances, accumulates them into dW via outer products, and applies SGD. The loss should decrease over epochs, showing the embedding separating classes with a margin.

Time: O(E * T * (k * n + d)) ≈ O(E T k n) where E is epochs, T is triplets per epoch, n is input dim, and k is output dim; each step computes a few matrix-vector multiplies and vector opsSpace: O(k n) for parameters and gradients, plus O(k) and O(n) temporaries

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Small epsilon for numerical stability
5	const double EPS = 1e-12;
6
7	// L2-normalize a vector in-place
8	void l2_normalize(vector<double>& v) {
9	double norm = 0.0;
10	for (double x : v) norm += x * x;
11	norm = sqrt(max(norm, EPS));
12	for (double& x : v) x /= norm;
13	}
14
15	// Squared Euclidean distance between two vectors
16	double sqeuclidean(const vector<double>& a, const vector<double>& b) {
17	assert(a.size() == b.size());
18	double s = 0.0;
19	for (size_t i = 0; i < a.size(); ++i) {
20	double d = a[i] - b[i];
21	s += d * d;
22	}
23	return s;
24	}
25
26	// Triplet loss for a batch of (anchor, positive, negative) embeddings
27	// If use_l2_norm is true, normalize each embedding before computing distances
28	// Returns average loss over the batch
29	double triplet_loss_batch(vector<vector<double>> A,
30	vector<vector<double>> P,
31	vector<vector<double>> N,
32	double margin,
33	bool use_l2_norm = false) {
34	size_t B = A.size();
35	assert(P.size() == B && N.size() == B);
36	if (B == 0) return 0.0;
37
38	// Optionally normalize in copies (we pass by value to keep originals intact)
39	if (use_l2_norm) {
40	for (size_t i = 0; i < B; ++i) {
41	l2_normalize(A[i]);
42	l2_normalize(P[i]);
43	l2_normalize(N[i]);
44	}
45	}
46
47	double loss_sum = 0.0;
48	for (size_t i = 0; i < B; ++i) {
49	double dap = sqeuclidean(A[i], P[i]);
50	double dan = sqeuclidean(A[i], N[i]);
51	double violation = dap - dan + margin; // hinge argument
52	if (violation > 0.0) loss_sum += violation; // max(0, violation)
53	}
54	return loss_sum / static_cast<double>(B);
55	}
56
57	int main() {
58	// Tiny demo with 3-D embeddings
59	vector<vector<double>> A = {{1.0, 0.9, 1.1}, {0.0, 0.1, -0.1}}; // anchors
60	vector<vector<double>> P = {{1.1, 1.0, 1.0}, {0.05, 0.1, 0.0}}; // positives (same class)
61	vector<vector<double>> N = {{3.0, -2.0, 0.0}, {-1.0, 1.0, 1.0}}; // negatives (different class)
62
63	double margin = 0.5;
64
65	double L_no_norm = triplet_loss_batch(A, P, N, margin, false);
66	double L_norm = triplet_loss_batch(A, P, N, margin, true);
67
68	cout << fixed << setprecision(6);
69	cout << "Triplet loss (no L2-norm): " << L_no_norm << "\n";
70	cout << "Triplet loss (with L2-norm): " << L_norm << "\n";
71	return 0;
72	}
73

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Utility: squared Euclidean distance
5	double sqeuclidean(const vector<double>& a, const vector<double>& b) {
6	double s = 0.0; for (size_t i=0;i<a.size();++i){ double d=a[i]-b[i]; s+=d*d; } return s; }
7
8	// Matrix-vector multiply: y = W * x, where W is (k x n)
9	vector<double> matvec(const vector<vector<double>>& W, const vector<double>& x) {
10	size_t k = W.size(); size_t n = W[0].size();
11	vector<double> y(k, 0.0);
12	for (size_t i = 0; i < k; ++i) {
13	double s = 0.0; for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j]; y[i] = s;
14	}
15	return y;
16	}
17
18	// Outer product: G += g * x^T (accumulate into G)
19	void outer_add(vector<vector<double>>& G, const vector<double>& g, const vector<double>& x) {
20	for (size_t i = 0; i < G.size(); ++i)
21	for (size_t j = 0; j < G[0].size(); ++j)
22	G[i][j] += g[i] * x[j];
23	}
24
25	// Compute triplet loss and gradients w.r.t. embeddings (squared Euclidean)
26	// Returns loss; fills ga, gp, gn with gradients when active, else zeros
27	double triplet_loss_embed_grad(const vector<double>& a,
28	const vector<double>& p,
29	const vector<double>& n,
30	double margin,
31	vector<double>& ga,
32	vector<double>& gp,
33	vector<double>& gn) {
34	double dap = sqeuclidean(a, p);
35	double dan = sqeuclidean(a, n);
36	double v = dap - dan + margin;
37	fill(ga.begin(), ga.end(), 0.0);
38	fill(gp.begin(), gp.end(), 0.0);
39	fill(gn.begin(), gn.end(), 0.0);
40	if (v > 0.0) {
41	// Gradients for squared Euclidean distances
42	for (size_t i = 0; i < a.size(); ++i) {
43	ga[i] = 2.0 * (a[i] - p[i]) - 2.0 * (a[i] - n[i]);
44	gp[i] = -2.0 * (a[i] - p[i]);
45	gn[i] = 2.0 * (a[i] - n[i]);
46	}
47	return v;
48	}
49	return 0.0;
50	}
51
52	int main() {
53	ios::sync_with_stdio(false);
54	cin.tie(nullptr);
55
56	// Create a tiny 2D toy dataset with two classes
57	// Class 0 around (0,0); Class 1 around (3,3)
58	std::mt19937 rng(42);
59	std::normal_distribution<double> noise(0.0, 0.2);
60	vector<pair<vector<double>, int>> data; // (x, label)
61	for (int i = 0; i < 20; ++i) {
62	vector<double> x0 = {noise(rng), noise(rng)};
63	vector<double> x1 = {3.0 + noise(rng), 3.0 + noise(rng)};
64	data.push_back({x0, 0});
65	data.push_back({x1, 1});
66	}
67
68	// Linear embedding W: k x n (here k=2, n=2)
69	vector<vector<double>> W = {{1.0, 0.0}, {0.0, 1.0}}; // start as identity
70
71	double lr = 0.05; // learning rate
72	double margin = 0.5; // triplet margin
73	int epochs = 50;
74	int triplets_per_epoch = 200;
75
76	std::uniform_int_distribution<int> idx_dist(0, (int)data.size() - 1);
77
78	for (int epoch = 1; epoch <= epochs; ++epoch) {
79	// Accumulate loss for logging
80	double epoch_loss = 0.0;
81	// Zero gradient accumulator for W
82	vector<vector<double>> dW(W.size(), vector<double>(W[0].size(), 0.0));
83
84	for (int t = 0; t < triplets_per_epoch; ++t) {
85	// Sample an anchor and find a positive and a negative
86	int ia = idx_dist(rng);
87	int la = data[ia].second;
88	// Positive: resample until same label but different index
89	int ip = ia;
90	while (ip == ia \|\| data[ip].second != la) ip = idx_dist(rng);
91	// Negative: resample until different label
92	int in = ia;
93	while (data[in].second == la) in = idx_dist(rng);
94
95	const vector<double>& xa = data[ia].first;
96	const vector<double>& xp = data[ip].first;
97	const vector<double>& xn = data[in].first;
98
99	// Forward: embeddings
100	vector<double> ea = matvec(W, xa);
101	vector<double> ep = matvec(W, xp);
102	vector<double> en = matvec(W, xn);
103
104	// Backward: gradients w.r.t. embeddings
105	vector<double> ga(ea.size()), gp(ep.size()), gn(en.size());
106	double L = triplet_loss_embed_grad(ea, ep, en, margin, ga, gp, gn);
107	epoch_loss += L;
108
109	if (L > 0.0) {
110	// Chain rule: dW += ga * xa^T + gp * xp^T + gn * xn^T
111	outer_add(dW, ga, xa);
112	outer_add(dW, gp, xp);
113	outer_add(dW, gn, xn);
114	}
115	}
116
117	// Average gradients and update W (simple SGD)
118	double scale = 1.0 / static_cast<double>(triplets_per_epoch);
119	for (size_t i = 0; i < W.size(); ++i)
120	for (size_t j = 0; j < W[0].size(); ++j)
121	W[i][j] -= lr * scale * dW[i][j];
122
123	cout << "Epoch " << epoch << ": avg loss per triplet = "
124	<< fixed << setprecision(6) << (epoch_loss / triplets_per_epoch) << "\n";
125	}
126
127	// Show learned W
128	cout << "Learned W:\n";
129	for (auto& row : W) {
130	for (double w : row) cout << w << ' ';
131	cout << '\n';
132	}
133
134	return 0;
135	}
136