📚TheoryIntermediate

Label Smoothing

Key Points

•
Label smoothing replaces a hard one-hot target with a slightly softened distribution to reduce model overconfidence.
•
The smoothed target mixes the one-hot vector with the uniform distribution: $y^{(l s)}$ = (1-ε)·on $e_{h} o t$ + $ε / K$ for K classes.
•
Using label smoothing with cross-entropy is equivalent to adding a penalty that encourages higher-entropy (less peaked) predictions.
•
The gradient for softmax + cross-entropy becomes p − $y^{(l s)}$ , so the model is nudged less aggressively toward absolute certainty.
•
Typical ε values are 0.05–0.2; too large ε can harm accuracy by blurring class distinctions.
•
It is most beneficial with small datasets, noisy labels, or when calibration (reliable probabilities) matters.
•
Do not apply smoothing at inference and do not double-smooth if your labels are already soft (e.g., from a teacher model).
•
Time and memory costs are minimal: O(K) per example, which is negligible compared to forward/backward passes of the network.

Prerequisites

→Probability distributions — Understanding targets as probability vectors and the role of the uniform distribution.
→Vectors and arrays — Targets and predictions are K-dimensional vectors manipulated element-wise.
→Softmax — Maps logits to probabilities; essential for multi-class cross-entropy.
→Cross-entropy loss — Label smoothing plugs into cross-entropy by changing the target distribution.
→Gradient descent / backpropagation — To see how smoothing changes gradients (p − y^{(ls)}).
→Numerical stability (log-sum-exp) — Stable softmax avoids overflow/underflow in exponentials and logs.
→Overfitting and regularization — Label smoothing is a regularization method aimed at better generalization.
→One-hot encoding — The baseline target representation that smoothing modifies.

Detailed Explanation

Tap terms for definitions

01Overview

Label smoothing is a simple regularization technique for classification. Instead of training a model against hard one-hot targets that demand absolute certainty, we replace each target with a softened probability distribution. For K classes and smoothing parameter ε in [0,1), we assign (1−ε)+ε/K to the correct class and ε/K to each incorrect class. This subtly discourages the model from placing probability 1.0 on the correct class and 0.0 on all others, which can improve generalization and probability calibration. Practically, you keep the same loss (cross-entropy), but you change the target vector used in that loss. The change is small but impactful: it reduces the incentive to memorize noise and punishes overly confident predictions. Label smoothing is widely used in computer vision, natural language processing (e.g., attention-based sequence models), and other multi-class problems, often improving both validation accuracy and calibration with negligible computational overhead.

02Intuition & Analogies

Imagine grading a multiple-choice exam where you know a student selected the correct answer, but you also recognize that the question was a bit ambiguous. Rather than awarding a rigid 100% to the correct option and 0% to all others, you give a tiny bit of “partial credit” to other plausible choices. This does not change who is right, but it communicates uncertainty in a measured, systematic way. In model training, one-hot labels say “be 100% certain the correct class is right and 0% for the rest.” Real-world data often contains noise: mislabeled images, ambiguous categories, or overlapping classes. If we always demand absolute certainty, the model can become overconfident and overfit, memorizing quirks of the training set. Label smoothing tells the model: “Aim for high confidence, but leave room for doubt.” Another analogy: a weather forecast rarely says 100% chance of rain; even on stormy days it might say 90–95%. The forecast remains decisive but avoids extremes unless warranted. Similarly, label smoothing encourages the network to distribute a small fraction of probability mass to other classes, which acts like a gentle regularizer. The result is typically better calibration (probabilities that match reality), less overfitting to noise, and sometimes improved top-1 accuracy. The tweak is tiny—just adjust the targets—but the downstream effects on learning dynamics are meaningful.

03Formal Definition

Consider a K-class classification problem. Let y

\in

e_{1}

\dots

e_{K}

\} be a one-hot target and let u be the uniform distribution over classes,

u_{i} = 1/ K

. For smoothing parameter

ε

\in

[0,1), define the smoothed target as

y^{(l s)}

= (1-

ε

) y +

ε

u. Concretely, if the true class is t, then

y_{t}^{(l s)}

= (1-

ε

) +

ε

/K and

y_{i}^{(l s)}

ε

/K for i

\neq =

t. Training typically minimizes cross-entropy L = -

\sum_{i = 1}^{K}

y_{i}^{(l s)}

lo g

p_{i}

, where p is the model’s softmax output. This is equivalent to a convex combination of two cross-entropies: L = (1-

ε

)\,CE(y,p) +

ε

\,CE(u,p), i.e., a standard loss against the one-hot label plus a small penalty that pushes p toward the uniform distribution. With softmax

p_{i}

exp

(

z_{i}

\sum_{j}

exp

(

z_{j}

), the gradient w.r.t. logits z is

\partial

\partial

z_i = p_i - y^{(ls

)}_i, identical to the usual form but with y replaced by

y^{(l s)}

. In an unconstrained setting, minimizing L makes p match

y^{(l s)}

, so the optimal predicted probability for the true class is strictly less than 1 when

ε

>0. This provides a principled way to reduce overconfidence without changing the model architecture.

04When to Use

When models overfit or become overconfident on training data, especially in small or noisy datasets.
In large-scale vision tasks (e.g., ImageNet) and NLP sequence models (e.g., translation) where label smoothing is a standard baseline that often improves validation metrics and calibration.
When calibrated probabilities matter (risk-sensitive decisions), because smoothing reduces extreme probabilities and better aligns confidence with accuracy.
When labels may be imperfect, ambiguous, or weakly defined; smoothing reduces sensitivity to outliers and mislabeled examples.
When you want a drop-in regularizer that requires almost no code change and adds negligible computation. Avoid or adapt it when: (a) labels are already soft (e.g., knowledge distillation or human-provided probability labels); (b) you have multi-label problems with independent binary targets—use a binary variant (soften 0/1 toward 0.5) rather than multi-class smoothing; (c) you require perfectly sharp posteriors for some reason (rare); or (d) extreme class imbalance suggests class-dependent smoothing instead of a single global \varepsilon.

⚠️Common Mistakes

Applying smoothing at inference time. Only targets used for training should be smoothed; never smooth model predictions at test time.
Using too large \varepsilon (e.g., ≥ 0.3 for small K), which blurs distinctions and can hurt accuracy. Start with 0.05–0.1 and tune.
Forgetting to use the correct number of classes K when computing \varepsilon/K, especially if some classes are missing in a batch.
Double-smoothing: if your labels are already soft (e.g., teacher-student distillation or mixup), adding label smoothing may overly flatten targets.
Applying multi-class smoothing to multi-label problems that use binary cross-entropy; instead, soften each binary target toward 0.5: y^{(ls)} = (1-\varepsilon) y + \varepsilon,0.5.
Numerical instability from taking \log p_i when p_i is extremely small; use a stable softmax and clamp probabilities if needed.
Misinterpreting accuracy drops on training: smoothing often lowers training accuracy while improving validation accuracy and calibration; judge by validation metrics and reliability, not just training accuracy.

Key Formulas

Label Smoothing Definition

y^{(l s)} = (1 - ε) y + ε u

Explanation: The smoothed target is a convex combination of the one-hot label y and the uniform distribution u. This reduces the target’s peak and spreads a small mass over other classes.

Uniform Distribution

u_{i} = \frac{1}{K}, i = 1, \dots, K

Explanation: Each class gets equal probability in the uniform distribution. This is the baseline mixed into the one-hot label during smoothing.

Cross-Entropy

L_{CE} (y, p) = - i = 1 \sum K y_{i} lo g p_{i}

Explanation: The cross-entropy between target y and prediction p is the negative expected log-probability of the true label. Minimizing this encourages the model to assign high probability to the true class.

Smoothed Cross-Entropy

L_{l s} (p) = - i = 1 \sum K y_{i}^{(l s)} lo g p_{i}

Explanation: Training with label smoothing simply replaces y with $y^{(l s)}$ in the cross-entropy. Computation and gradients stay the same form.

Mixture Loss View

L_{l s} (p) = (1 - ε) CE (y, p) + ε CE (u, p)

Explanation: Label smoothing equals a weighted average of the standard cross-entropy and a penalty that pushes predictions toward uniform. This reveals its regularizing effect.

Softmax

p_{i} = \frac{exp ( z _{i} )}{\sum _{j = 1}^{K} exp ( z _{j} )}

Explanation: Softmax converts logits z into a probability distribution p. It is used with cross-entropy in multi-class classification.

Gradient w.r.t. Logits

\frac{\partial L _{l s}}{\partial z _{i}} = p_{i} - y_{i}^{(l s)}

Explanation: For softmax plus cross-entropy, the derivative keeps the same simple form: prediction minus (smoothed) target. This makes implementation straightforward.

Uniform CE Decomposition

CE (u, p) = H (u) + K L (u ∥ p) = lo g K + K L (u ∥ p)

Explanation: The cross-entropy between uniform and p equals the uniform’s entropy plus KL divergence from uniform to p. Minimizing it discourages low-entropy (peaked) predictions.

Smoothing Range

ε \in [0, 1)

Explanation: The smoothing parameter must be less than 1 to keep targets valid probabilities. Typical practical values are 0.05–0.2.

Optimal Prediction Under LS

p argmin L_{l s} (p) = y^{(l s)}

Explanation: With no modeling constraints, the loss is minimized when the predicted distribution matches the smoothed target. This shows why the method reduces optimal peak probability.

Complexity Analysis

Label smoothing adds negligible computational overhead. For each training example and K classes, constructing the smoothed target

y^{(l s)}

requires O(K) time: fill a vector with

ε / K

and add (1−

ε)

at the true class index. Memory for the target is O(K) per example, identical in order to a standard one-hot target. When computing cross-entropy, replacing y with

y^{(l s)}

does not change the asymptotic complexity: evaluating L = −∑

y_{i}^{(l s)}

log

p_{i}

remains O(K). In a batch of size N, both time and memory scale as O(NK), which is usually dwarfed by the forward and backward passes through the network layers (matrix multiplications, convolutions), typically O(N·d·K) or higher depending on architecture. Numerical stability requires a stable softmax using the log-sum-exp trick, which also costs O(K). The gradient computation for softmax plus cross-entropy retains the same form, ∂L/∂

z = p

−

y^{(l s)}

, computed in O(K). In practical C++ implementations, vectorized operations (SIMD) or BLAS can further reduce constants, but are not required. The technique does not increase model parameters or activations, so it adds no additional model-state memory beyond storing labels. Overall, label smoothing is an O(K) adjustment per example with minimal, constant-factor overhead and no impact on the asymptotic complexity of training.

Code Examples

Create smoothed labels (single example and mini-batch)

1 #include <iostream>
2 #include <vector>
3 #include <stdexcept>
4 #include <iomanip>
5 
6 // Create a smoothed label for a K-class problem.
7 // y_ls = (1 - epsilon) * one_hot + (epsilon / K) * 1
8 std::vector<double> smoothLabel(int true_class, int K, double epsilon) {
9     if (K <= 1) throw std::invalid_argument("K must be > 1");
10     if (true_class < 0 || true_class >= K) throw std::out_of_range("true_class out of range");
11     if (epsilon < 0.0 || epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)");
12 
13     std::vector<double> y(K, epsilon / K); // start with uniform mass epsilon/K
14     y[true_class] += (1.0 - epsilon);      // add (1 - epsilon) to the true class
15     return y;
16 }
17 
18 // Batch version: produce smoothed labels for each example in a mini-batch
19 std::vector<std::vector<double>> smoothBatch(const std::vector<int>& labels, int K, double epsilon) {
20     std::vector<std::vector<double>> Y;
21     Y.reserve(labels.size());
22     for (int t : labels) {
23         Y.push_back(smoothLabel(t, K, epsilon));
24     }
25     return Y;
26 }
27 
28 int main() {
29     int K = 5;
30     int true_class = 2; // 0-based index
31     double epsilon = 0.1;
32 
33     auto y = smoothLabel(true_class, K, epsilon);
34     std::cout << std::fixed << std::setprecision(4);
35     std::cout << "Smoothed label (single): ";
36     for (double v : y) std::cout << v << ' ';
37     std::cout << "\n";
38 
39     std::vector<int> batch_labels = {2, 0, 4};
40     auto Y = smoothBatch(batch_labels, K, epsilon);
41     std::cout << "Batch smoothed labels:\n";
42     for (const auto& row : Y) {
43         for (double v : row) std::cout << v << ' ';
44         std::cout << '\n';
45     }
46 
47     return 0;
48 }
49

This program constructs smoothed target vectors for a K-class classification task using the standard formula y^{(ls)} = (1−ε)·one_hot + ε/K. The single-example function validates inputs and returns the smoothed distribution. The batch variant applies the same transformation to a vector of class indices, producing a mini-batch of smoothed labels.

Time: O(K) per example, O(NK) for a batch of size NSpace: O(K) per example, O(NK) for a batch

Stable softmax and cross-entropy with label smoothing (from logits)

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <stdexcept>
5 #include <iomanip>
6 
7 // Compute a numerically stable softmax of logits.
8 std::vector<double> softmax(const std::vector<double>& z) {
9     if (z.empty()) throw std::invalid_argument("logits vector is empty");
10     double max_z = z[0];
11     for (double v : z) if (v > max_z) max_z = v; // for stability
12     double sum_exp = 0.0;
13     std::vector<double> p(z.size());
14     for (size_t i = 0; i < z.size(); ++i) {
15         p[i] = std::exp(z[i] - max_z);
16         sum_exp += p[i];
17     }
18     for (double& v : p) v /= sum_exp;
19     return p;
20 }
21 
22 std::vector<double> smoothLabel(int true_class, int K, double epsilon) {
23     if (K <= 1) throw std::invalid_argument("K must be > 1");
24     if (true_class < 0 || true_class >= K) throw std::out_of_range("true_class out of range");
25     if (epsilon < 0.0 || epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)");
26     std::vector<double> y(K, epsilon / K);
27     y[true_class] += (1.0 - epsilon);
28     return y;
29 }
30 
31 // Cross-entropy with label smoothing: L = -sum_i y_ls[i] * log p[i]
32 // Takes raw logits, applies softmax, then computes the loss.
33 double crossEntropySmoothedFromLogits(const std::vector<double>& logits, int true_class, double epsilon) {
34     auto p = softmax(logits);
35     auto y_ls = smoothLabel(true_class, static_cast<int>(p.size()), epsilon);
36 
37     const double eps = 1e-15; // clamp to avoid log(0)
38     double loss = 0.0;
39     for (size_t i = 0; i < p.size(); ++i) {
40         double pi = std::min(std::max(p[i], eps), 1.0 - eps);
41         loss += -y_ls[i] * std::log(pi);
42     }
43     return loss;
44 }
45 
46 // For comparison: standard cross-entropy with hard label
47 double crossEntropyHardFromLogits(const std::vector<double>& logits, int true_class) {
48     auto p = softmax(logits);
49     const double eps = 1e-15;
50     double pt = std::min(std::max(p[true_class], eps), 1.0 - eps);
51     return -std::log(pt);
52 }
53 
54 int main() {
55     std::vector<double> logits = {1.2, -0.3, 2.1, 0.0}; // example scores
56     int true_class = 2; // correct class index
57     double epsilon = 0.1;
58 
59     double L_hard = crossEntropyHardFromLogits(logits, true_class);
60     double L_smooth = crossEntropySmoothedFromLogits(logits, true_class, epsilon);
61 
62     std::cout << std::fixed << std::setprecision(6);
63     std::cout << "Hard-label CE:    " << L_hard << "\n";
64     std::cout << "Smoothed-label CE:" << L_smooth << "\n";
65 
66     return 0;
67 }
68

This code computes a numerically stable softmax, then evaluates cross-entropy both with and without label smoothing for a single example. It demonstrates how label smoothing only changes the target used in the cross-entropy. Clamping probabilities avoids log(0) issues.

Time: O(K) per exampleSpace: O(K)

Gradient of softmax cross-entropy with label smoothing (p − y^{(ls)})

1 #include <iostream>
2 #include <vector>
3 #include <cmath>
4 #include <iomanip>
5 
6 std::vector<double> softmax(const std::vector<double>& z) {
7     double max_z = z[0];
8     for (double v : z) if (v > max_z) max_z = v;
9     double sum_exp = 0.0;
10     std::vector<double> p(z.size());
11     for (size_t i = 0; i < z.size(); ++i) {
12         p[i] = std::exp(z[i] - max_z);
13         sum_exp += p[i];
14     }
15     for (double &v : p) v /= sum_exp;
16     return p;
17 }
18 
19 std::vector<double> smoothLabel(int true_class, int K, double epsilon) {
20     std::vector<double> y(K, epsilon / K);
21     y[true_class] += (1.0 - epsilon);
22     return y;
23 }
24 
25 // Gradient wrt logits z: grad = p - y_ls
26 std::vector<double> gradLogitsLabelSmoothed(const std::vector<double>& logits, int true_class, double epsilon) {
27     auto p = softmax(logits);
28     auto y_ls = smoothLabel(true_class, static_cast<int>(p.size()), epsilon);
29     std::vector<double> g(p.size());
30     for (size_t i = 0; i < p.size(); ++i) g[i] = p[i] - y_ls[i];
31     return g;
32 }
33 
34 int main() {
35     std::vector<double> logits = {2.0, 0.5, -1.0};
36     int true_class = 0;
37     double epsilon = 0.1;
38 
39     auto g_smooth = gradLogitsLabelSmoothed(logits, true_class, epsilon);
40 
41     // For comparison, gradient with hard labels: p - y (where y is one-hot)
42     auto p = softmax(logits);
43     std::vector<double> g_hard = p;
44     g_hard[true_class] -= 1.0;
45 
46     std::cout << std::fixed << std::setprecision(6);
47     std::cout << "Gradient (hard):   ";
48     for (double v : g_hard) std::cout << v << ' ';
49     std::cout << "\n";
50 
51     std::cout << "Gradient (smooth): ";
52     for (double v : g_smooth) std::cout << v << ' ';
53     std::cout << "\n";
54 
55     return 0;
56 }
57

This example shows that with label smoothing, the gradient with respect to logits is p − y^{(ls)}. Compared to hard labels, the true class receives a slightly smaller negative gradient magnitude, and other classes receive slightly larger positive gradients, reducing overconfident updates.

Time: O(K) per exampleSpace: O(K)

1	#include <iostream>
2	#include <vector>
3	#include <stdexcept>
4	#include <iomanip>
5
6	// Create a smoothed label for a K-class problem.
7	// y_ls = (1 - epsilon) * one_hot + (epsilon / K) * 1
8	std::vector<double> smoothLabel(int true_class, int K, double epsilon) {
9	if (K <= 1) throw std::invalid_argument("K must be > 1");
10	if (true_class < 0 \|\| true_class >= K) throw std::out_of_range("true_class out of range");
11	if (epsilon < 0.0 \|\| epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)");
12
13	std::vector<double> y(K, epsilon / K); // start with uniform mass epsilon/K
14	y[true_class] += (1.0 - epsilon); // add (1 - epsilon) to the true class
15	return y;
16	}
17
18	// Batch version: produce smoothed labels for each example in a mini-batch
19	std::vector<std::vector<double>> smoothBatch(const std::vector<int>& labels, int K, double epsilon) {
20	std::vector<std::vector<double>> Y;
21	Y.reserve(labels.size());
22	for (int t : labels) {
23	Y.push_back(smoothLabel(t, K, epsilon));
24	}
25	return Y;
26	}
27
28	int main() {
29	int K = 5;
30	int true_class = 2; // 0-based index
31	double epsilon = 0.1;
32
33	auto y = smoothLabel(true_class, K, epsilon);
34	std::cout << std::fixed << std::setprecision(4);
35	std::cout << "Smoothed label (single): ";
36	for (double v : y) std::cout << v << ' ';
37	std::cout << "\n";
38
39	std::vector<int> batch_labels = {2, 0, 4};
40	auto Y = smoothBatch(batch_labels, K, epsilon);
41	std::cout << "Batch smoothed labels:\n";
42	for (const auto& row : Y) {
43	for (double v : row) std::cout << v << ' ';
44	std::cout << '\n';
45	}
46
47	return 0;
48	}
49

1	#include <iostream>
2	#include <vector>
3	#include <cmath>
4	#include <stdexcept>
5	#include <iomanip>
6
7	// Compute a numerically stable softmax of logits.
8	std::vector<double> softmax(const std::vector<double>& z) {
9	if (z.empty()) throw std::invalid_argument("logits vector is empty");
10	double max_z = z[0];
11	for (double v : z) if (v > max_z) max_z = v; // for stability
12	double sum_exp = 0.0;
13	std::vector<double> p(z.size());
14	for (size_t i = 0; i < z.size(); ++i) {
15	p[i] = std::exp(z[i] - max_z);
16	sum_exp += p[i];
17	}
18	for (double& v : p) v /= sum_exp;
19	return p;
20	}
21
22	std::vector<double> smoothLabel(int true_class, int K, double epsilon) {
23	if (K <= 1) throw std::invalid_argument("K must be > 1");
24	if (true_class < 0 \|\| true_class >= K) throw std::out_of_range("true_class out of range");
25	if (epsilon < 0.0 \|\| epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)");
26	std::vector<double> y(K, epsilon / K);
27	y[true_class] += (1.0 - epsilon);
28	return y;
29	}
30
31	// Cross-entropy with label smoothing: L = -sum_i y_ls[i] * log p[i]
32	// Takes raw logits, applies softmax, then computes the loss.
33	double crossEntropySmoothedFromLogits(const std::vector<double>& logits, int true_class, double epsilon) {
34	auto p = softmax(logits);
35	auto y_ls = smoothLabel(true_class, static_cast<int>(p.size()), epsilon);
36
37	const double eps = 1e-15; // clamp to avoid log(0)
38	double loss = 0.0;
39	for (size_t i = 0; i < p.size(); ++i) {
40	double pi = std::min(std::max(p[i], eps), 1.0 - eps);
41	loss += -y_ls[i] * std::log(pi);
42	}
43	return loss;
44	}
45
46	// For comparison: standard cross-entropy with hard label
47	double crossEntropyHardFromLogits(const std::vector<double>& logits, int true_class) {
48	auto p = softmax(logits);
49	const double eps = 1e-15;
50	double pt = std::min(std::max(p[true_class], eps), 1.0 - eps);
51	return -std::log(pt);
52	}
53
54	int main() {
55	std::vector<double> logits = {1.2, -0.3, 2.1, 0.0}; // example scores
56	int true_class = 2; // correct class index
57	double epsilon = 0.1;
58
59	double L_hard = crossEntropyHardFromLogits(logits, true_class);
60	double L_smooth = crossEntropySmoothedFromLogits(logits, true_class, epsilon);
61
62	std::cout << std::fixed << std::setprecision(6);
63	std::cout << "Hard-label CE: " << L_hard << "\n";
64	std::cout << "Smoothed-label CE:" << L_smooth << "\n";
65
66	return 0;
67	}
68