Label Smoothing
Key Points
- •Label smoothing replaces a hard one-hot target with a slightly softened distribution to reduce model overconfidence.
- •The smoothed target mixes the one-hot vector with the uniform distribution: = (1-ε)·on + for K classes.
- •Using label smoothing with cross-entropy is equivalent to adding a penalty that encourages higher-entropy (less peaked) predictions.
- •The gradient for softmax + cross-entropy becomes p − , so the model is nudged less aggressively toward absolute certainty.
- •Typical ε values are 0.05–0.2; too large ε can harm accuracy by blurring class distinctions.
- •It is most beneficial with small datasets, noisy labels, or when calibration (reliable probabilities) matters.
- •Do not apply smoothing at inference and do not double-smooth if your labels are already soft (e.g., from a teacher model).
- •Time and memory costs are minimal: O(K) per example, which is negligible compared to forward/backward passes of the network.
Prerequisites
- →Probability distributions — Understanding targets as probability vectors and the role of the uniform distribution.
- →Vectors and arrays — Targets and predictions are K-dimensional vectors manipulated element-wise.
- →Softmax — Maps logits to probabilities; essential for multi-class cross-entropy.
- →Cross-entropy loss — Label smoothing plugs into cross-entropy by changing the target distribution.
- →Gradient descent / backpropagation — To see how smoothing changes gradients (p − y^{(ls)}).
- →Numerical stability (log-sum-exp) — Stable softmax avoids overflow/underflow in exponentials and logs.
- →Overfitting and regularization — Label smoothing is a regularization method aimed at better generalization.
- →One-hot encoding — The baseline target representation that smoothing modifies.
Detailed Explanation
Tap terms for definitions01Overview
Label smoothing is a simple regularization technique for classification. Instead of training a model against hard one-hot targets that demand absolute certainty, we replace each target with a softened probability distribution. For K classes and smoothing parameter ε in [0,1), we assign (1−ε)+ε/K to the correct class and ε/K to each incorrect class. This subtly discourages the model from placing probability 1.0 on the correct class and 0.0 on all others, which can improve generalization and probability calibration. Practically, you keep the same loss (cross-entropy), but you change the target vector used in that loss. The change is small but impactful: it reduces the incentive to memorize noise and punishes overly confident predictions. Label smoothing is widely used in computer vision, natural language processing (e.g., attention-based sequence models), and other multi-class problems, often improving both validation accuracy and calibration with negligible computational overhead.
02Intuition & Analogies
Imagine grading a multiple-choice exam where you know a student selected the correct answer, but you also recognize that the question was a bit ambiguous. Rather than awarding a rigid 100% to the correct option and 0% to all others, you give a tiny bit of “partial credit” to other plausible choices. This does not change who is right, but it communicates uncertainty in a measured, systematic way. In model training, one-hot labels say “be 100% certain the correct class is right and 0% for the rest.” Real-world data often contains noise: mislabeled images, ambiguous categories, or overlapping classes. If we always demand absolute certainty, the model can become overconfident and overfit, memorizing quirks of the training set. Label smoothing tells the model: “Aim for high confidence, but leave room for doubt.” Another analogy: a weather forecast rarely says 100% chance of rain; even on stormy days it might say 90–95%. The forecast remains decisive but avoids extremes unless warranted. Similarly, label smoothing encourages the network to distribute a small fraction of probability mass to other classes, which acts like a gentle regularizer. The result is typically better calibration (probabilities that match reality), less overfitting to noise, and sometimes improved top-1 accuracy. The tweak is tiny—just adjust the targets—but the downstream effects on learning dynamics are meaningful.
03Formal Definition
04When to Use
- When models overfit or become overconfident on training data, especially in small or noisy datasets.
- In large-scale vision tasks (e.g., ImageNet) and NLP sequence models (e.g., translation) where label smoothing is a standard baseline that often improves validation metrics and calibration.
- When calibrated probabilities matter (risk-sensitive decisions), because smoothing reduces extreme probabilities and better aligns confidence with accuracy.
- When labels may be imperfect, ambiguous, or weakly defined; smoothing reduces sensitivity to outliers and mislabeled examples.
- When you want a drop-in regularizer that requires almost no code change and adds negligible computation. Avoid or adapt it when: (a) labels are already soft (e.g., knowledge distillation or human-provided probability labels); (b) you have multi-label problems with independent binary targets—use a binary variant (soften 0/1 toward 0.5) rather than multi-class smoothing; (c) you require perfectly sharp posteriors for some reason (rare); or (d) extreme class imbalance suggests class-dependent smoothing instead of a single global \varepsilon.
⚠️Common Mistakes
- Applying smoothing at inference time. Only targets used for training should be smoothed; never smooth model predictions at test time.
- Using too large \varepsilon (e.g., ≥ 0.3 for small K), which blurs distinctions and can hurt accuracy. Start with 0.05–0.1 and tune.
- Forgetting to use the correct number of classes K when computing \varepsilon/K, especially if some classes are missing in a batch.
- Double-smoothing: if your labels are already soft (e.g., teacher-student distillation or mixup), adding label smoothing may overly flatten targets.
- Applying multi-class smoothing to multi-label problems that use binary cross-entropy; instead, soften each binary target toward 0.5: y^{(ls)} = (1-\varepsilon) y + \varepsilon,0.5.
- Numerical instability from taking \log p_i when p_i is extremely small; use a stable softmax and clamp probabilities if needed.
- Misinterpreting accuracy drops on training: smoothing often lowers training accuracy while improving validation accuracy and calibration; judge by validation metrics and reliability, not just training accuracy.
Key Formulas
Label Smoothing Definition
Explanation: The smoothed target is a convex combination of the one-hot label y and the uniform distribution u. This reduces the target’s peak and spreads a small mass over other classes.
Uniform Distribution
Explanation: Each class gets equal probability in the uniform distribution. This is the baseline mixed into the one-hot label during smoothing.
Cross-Entropy
Explanation: The cross-entropy between target y and prediction p is the negative expected log-probability of the true label. Minimizing this encourages the model to assign high probability to the true class.
Smoothed Cross-Entropy
Explanation: Training with label smoothing simply replaces y with in the cross-entropy. Computation and gradients stay the same form.
Mixture Loss View
Explanation: Label smoothing equals a weighted average of the standard cross-entropy and a penalty that pushes predictions toward uniform. This reveals its regularizing effect.
Softmax
Explanation: Softmax converts logits z into a probability distribution p. It is used with cross-entropy in multi-class classification.
Gradient w.r.t. Logits
Explanation: For softmax plus cross-entropy, the derivative keeps the same simple form: prediction minus (smoothed) target. This makes implementation straightforward.
Uniform CE Decomposition
Explanation: The cross-entropy between uniform and p equals the uniform’s entropy plus KL divergence from uniform to p. Minimizing it discourages low-entropy (peaked) predictions.
Smoothing Range
Explanation: The smoothing parameter must be less than 1 to keep targets valid probabilities. Typical practical values are 0.05–0.2.
Optimal Prediction Under LS
Explanation: With no modeling constraints, the loss is minimized when the predicted distribution matches the smoothed target. This shows why the method reduces optimal peak probability.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <stdexcept> 4 #include <iomanip> 5 6 // Create a smoothed label for a K-class problem. 7 // y_ls = (1 - epsilon) * one_hot + (epsilon / K) * 1 8 std::vector<double> smoothLabel(int true_class, int K, double epsilon) { 9 if (K <= 1) throw std::invalid_argument("K must be > 1"); 10 if (true_class < 0 || true_class >= K) throw std::out_of_range("true_class out of range"); 11 if (epsilon < 0.0 || epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)"); 12 13 std::vector<double> y(K, epsilon / K); // start with uniform mass epsilon/K 14 y[true_class] += (1.0 - epsilon); // add (1 - epsilon) to the true class 15 return y; 16 } 17 18 // Batch version: produce smoothed labels for each example in a mini-batch 19 std::vector<std::vector<double>> smoothBatch(const std::vector<int>& labels, int K, double epsilon) { 20 std::vector<std::vector<double>> Y; 21 Y.reserve(labels.size()); 22 for (int t : labels) { 23 Y.push_back(smoothLabel(t, K, epsilon)); 24 } 25 return Y; 26 } 27 28 int main() { 29 int K = 5; 30 int true_class = 2; // 0-based index 31 double epsilon = 0.1; 32 33 auto y = smoothLabel(true_class, K, epsilon); 34 std::cout << std::fixed << std::setprecision(4); 35 std::cout << "Smoothed label (single): "; 36 for (double v : y) std::cout << v << ' '; 37 std::cout << "\n"; 38 39 std::vector<int> batch_labels = {2, 0, 4}; 40 auto Y = smoothBatch(batch_labels, K, epsilon); 41 std::cout << "Batch smoothed labels:\n"; 42 for (const auto& row : Y) { 43 for (double v : row) std::cout << v << ' '; 44 std::cout << '\n'; 45 } 46 47 return 0; 48 } 49
This program constructs smoothed target vectors for a K-class classification task using the standard formula y^{(ls)} = (1−ε)·one_hot + ε/K. The single-example function validates inputs and returns the smoothed distribution. The batch variant applies the same transformation to a vector of class indices, producing a mini-batch of smoothed labels.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <stdexcept> 5 #include <iomanip> 6 7 // Compute a numerically stable softmax of logits. 8 std::vector<double> softmax(const std::vector<double>& z) { 9 if (z.empty()) throw std::invalid_argument("logits vector is empty"); 10 double max_z = z[0]; 11 for (double v : z) if (v > max_z) max_z = v; // for stability 12 double sum_exp = 0.0; 13 std::vector<double> p(z.size()); 14 for (size_t i = 0; i < z.size(); ++i) { 15 p[i] = std::exp(z[i] - max_z); 16 sum_exp += p[i]; 17 } 18 for (double& v : p) v /= sum_exp; 19 return p; 20 } 21 22 std::vector<double> smoothLabel(int true_class, int K, double epsilon) { 23 if (K <= 1) throw std::invalid_argument("K must be > 1"); 24 if (true_class < 0 || true_class >= K) throw std::out_of_range("true_class out of range"); 25 if (epsilon < 0.0 || epsilon >= 1.0) throw std::invalid_argument("epsilon must be in [0,1)"); 26 std::vector<double> y(K, epsilon / K); 27 y[true_class] += (1.0 - epsilon); 28 return y; 29 } 30 31 // Cross-entropy with label smoothing: L = -sum_i y_ls[i] * log p[i] 32 // Takes raw logits, applies softmax, then computes the loss. 33 double crossEntropySmoothedFromLogits(const std::vector<double>& logits, int true_class, double epsilon) { 34 auto p = softmax(logits); 35 auto y_ls = smoothLabel(true_class, static_cast<int>(p.size()), epsilon); 36 37 const double eps = 1e-15; // clamp to avoid log(0) 38 double loss = 0.0; 39 for (size_t i = 0; i < p.size(); ++i) { 40 double pi = std::min(std::max(p[i], eps), 1.0 - eps); 41 loss += -y_ls[i] * std::log(pi); 42 } 43 return loss; 44 } 45 46 // For comparison: standard cross-entropy with hard label 47 double crossEntropyHardFromLogits(const std::vector<double>& logits, int true_class) { 48 auto p = softmax(logits); 49 const double eps = 1e-15; 50 double pt = std::min(std::max(p[true_class], eps), 1.0 - eps); 51 return -std::log(pt); 52 } 53 54 int main() { 55 std::vector<double> logits = {1.2, -0.3, 2.1, 0.0}; // example scores 56 int true_class = 2; // correct class index 57 double epsilon = 0.1; 58 59 double L_hard = crossEntropyHardFromLogits(logits, true_class); 60 double L_smooth = crossEntropySmoothedFromLogits(logits, true_class, epsilon); 61 62 std::cout << std::fixed << std::setprecision(6); 63 std::cout << "Hard-label CE: " << L_hard << "\n"; 64 std::cout << "Smoothed-label CE:" << L_smooth << "\n"; 65 66 return 0; 67 } 68
This code computes a numerically stable softmax, then evaluates cross-entropy both with and without label smoothing for a single example. It demonstrates how label smoothing only changes the target used in the cross-entropy. Clamping probabilities avoids log(0) issues.
1 #include <iostream> 2 #include <vector> 3 #include <cmath> 4 #include <iomanip> 5 6 std::vector<double> softmax(const std::vector<double>& z) { 7 double max_z = z[0]; 8 for (double v : z) if (v > max_z) max_z = v; 9 double sum_exp = 0.0; 10 std::vector<double> p(z.size()); 11 for (size_t i = 0; i < z.size(); ++i) { 12 p[i] = std::exp(z[i] - max_z); 13 sum_exp += p[i]; 14 } 15 for (double &v : p) v /= sum_exp; 16 return p; 17 } 18 19 std::vector<double> smoothLabel(int true_class, int K, double epsilon) { 20 std::vector<double> y(K, epsilon / K); 21 y[true_class] += (1.0 - epsilon); 22 return y; 23 } 24 25 // Gradient wrt logits z: grad = p - y_ls 26 std::vector<double> gradLogitsLabelSmoothed(const std::vector<double>& logits, int true_class, double epsilon) { 27 auto p = softmax(logits); 28 auto y_ls = smoothLabel(true_class, static_cast<int>(p.size()), epsilon); 29 std::vector<double> g(p.size()); 30 for (size_t i = 0; i < p.size(); ++i) g[i] = p[i] - y_ls[i]; 31 return g; 32 } 33 34 int main() { 35 std::vector<double> logits = {2.0, 0.5, -1.0}; 36 int true_class = 0; 37 double epsilon = 0.1; 38 39 auto g_smooth = gradLogitsLabelSmoothed(logits, true_class, epsilon); 40 41 // For comparison, gradient with hard labels: p - y (where y is one-hot) 42 auto p = softmax(logits); 43 std::vector<double> g_hard = p; 44 g_hard[true_class] -= 1.0; 45 46 std::cout << std::fixed << std::setprecision(6); 47 std::cout << "Gradient (hard): "; 48 for (double v : g_hard) std::cout << v << ' '; 49 std::cout << "\n"; 50 51 std::cout << "Gradient (smooth): "; 52 for (double v : g_smooth) std::cout << v << ' '; 53 std::cout << "\n"; 54 55 return 0; 56 } 57
This example shows that with label smoothing, the gradient with respect to logits is p − y^{(ls)}. Compared to hard labels, the true class receives a slightly smaller negative gradient magnitude, and other classes receive slightly larger positive gradients, reducing overconfident updates.