Softmax & Temperature Scaling
Key Points
- •Softmax turns arbitrary real-valued scores (logits) into probabilities that sum to one.
- •Temperature scaling divides logits by a positive scalar τ to control how sharp or smooth the softmax distribution is.
- •Small τ (< 1) makes the distribution peakier (more confident); large τ (> 1) makes it flatter (more uncertain).
- •In the limit, τ → 0+ yields a near one-hot at the argmax, and τ → ∞ approaches a uniform distribution.
- •Numerically stable softmax uses the log-sum-exp trick by subtracting the maximum logit before exponentiating.
- •For calibration, τ is fit on a validation set by minimizing negative log-likelihood without changing predicted labels’ order.
- •The softmax Jacobian scales with 1/τ, which affects gradients’ magnitude during learning and calibration.
- •In C++, implement stable softmax, temperature calibration via 1D optimization, and sampling using discret.
Prerequisites
- →Exponentials and Logarithms — Softmax uses exponentials and the log-sum-exp trick relies on logarithm properties.
- →Probability Distributions — Softmax outputs must be understood as probabilities that sum to one.
- →Vectors and Argmax — Softmax acts on vectors and preserves ordering; argmax is key to understanding limits as τ → 0.
- →Cross-Entropy / Negative Log-Likelihood — Temperature calibration optimizes NLL on a validation set.
- →Gradient and Chain Rule — Derivatives of softmax with temperature and τ optimization require gradient computations.
- →Floating-Point Arithmetic — Numerical stability (overflow/underflow) motivates the log-sum-exp trick and use of double precision.
- →Random Number Generation in C++ — Sampling from temperature-scaled distributions uses std::mt19937 and std::discrete_distribution.
- →Optimization Basics — Fitting τ involves 1D optimization such as gradient descent or line search.
Detailed Explanation
Tap terms for definitions01Overview
Softmax is a function that converts a vector of real numbers (called logits) into a probability distribution. Each output is between 0 and 1, and all outputs sum to 1. Temperature scaling is a simple modification that divides the logits by a positive scalar τ (tau) before applying softmax. This single parameter lets you control how confident or uncertain the resulting distribution appears.
In machine learning, logits often come from the final linear layer of a classifier. The plain softmax already transforms them into probabilities, but those probabilities can be overconfident or underconfident relative to true frequencies. Temperature scaling addresses this by stretching or compressing the logits. When τ is small (e.g., 0.5), differences between logits become amplified and the distribution becomes sharply peaked. When τ is large (e.g., 2.0), those differences are damped, producing a smoother, more uniform distribution.
Beyond calibration, temperature is widely used in sampling from language models and reinforcement learning policies to trade off exploitation (choose the current best) and exploration (try other options). Numerically stable computation is critical; we use the log-sum-exp trick to prevent overflow/underflow when exponentiating large or small numbers. In practice, τ can be learned on a validation set by minimizing the negative log-likelihood, improving the alignment between predicted probabilities and observed outcomes without changing which class is predicted.
02Intuition & Analogies
Imagine you are looking at a leaderboard of test scores. Softmax is like turning those scores into the chance that each student is the top performer, in a way that respects how far apart the scores are. If one student scores far above the rest, their chance becomes very high; if many are close, the chances are more evenly spread.
Now add a “temperature” knob that controls how much you care about small differences. With a low temperature (a strict judge), even tiny score differences lead to big swings in who is favored—softmax becomes spiky and concentrates on the top scorer. With a high temperature (a relaxed judge), you shrug at small differences and give everyone more similar chances—softmax becomes flatter. This is like adjusting your sensitivity: low τ magnifies differences; high τ smooths them.
In language generation, temperature is a creativity dial. Low temperature makes the model conservative, sticking closely to the most likely next word. High temperature encourages more diverse, surprising words. In calibration, temperature acts like a humility dial. If your model tends to be too sure of itself, increasing τ pulls probabilities toward the middle, making them better reflect real-world frequencies, while usually preserving which class ranks first. All the while, softmax keeps the results as valid probabilities that add up to 1.
Because exponentials grow fast, we must compute softmax carefully. Subtracting the largest score before exponentiating keeps numbers in a safe range, just like zeroing a scale before weighing items to avoid overload.
03Formal Definition
04When to Use
- Probability calibration: After training a classifier, fit a scalar temperature on a held-out validation set to make predicted probabilities better match observed frequencies (improves metrics like ECE or NLL).
- Sampling/trade-off control: In language models or policy gradients, adjust (\tau) to balance exploitation (low (\tau)) vs exploration (high (\tau)).
- Knowledge distillation: Use (\tau>1) to soften teacher distributions, making it easier for the student to learn from dark knowledge (non-argmax classes).
- Differentiable argmax approximation: Use small (\tau) softmax as a smooth proxy for argmax in optimization or attention mechanisms.
- Beam search and top-k sampling: Combine with (\tau) to tune diversity and risk during decoding.
- Score normalization: When combining heterogeneous scores (e.g., ensembling), softmax with a chosen (\tau) can normalize and control contrast. Avoid temperature scaling if you need calibrated uncertainty per-input feature or per-class (use more expressive calibration), or when labels are noisy in a way that temperature alone cannot fix.
⚠️Common Mistakes
- Using τ ≤ 0: Temperature must be strictly positive; optimize over s = log τ to enforce positivity.
- Ignoring numerical stability: Computing exp(z/τ) directly can overflow; always subtract max(z) before exponentiating (log-sum-exp trick).
- Confusing logits and probabilities: Do not apply softmax to probabilities again; temperature scaling expects raw logits.
- Misinterpreting direction: τ < 1 sharpens (more confident); τ > 1 flattens (less confident). Some libraries define inverse temperature β = 1/τ—read docs carefully.
- Calibrating on training data: Fit τ on a held-out validation set; fitting on training data leads to optimistic (biased) calibration estimates.
- Expecting accuracy gains: Temperature scaling usually does not change top-1 accuracy because it preserves argmax order; it improves probability calibration and NLL.
- Forgetting label bounds: Ensure labels are valid indices [0, K-1]; off-by-one errors silently corrupt NLL.
- Overfitting τ with too many degrees of freedom: Scalar τ is robust; per-class τ can overfit unless you have lots of data.
- Using low-precision types: For extreme logits or tiny τ, prefer double precision to reduce underflow/overflow.
- Not constraining optimization: When learning τ by gradient descent, cap step sizes and iterate until convergence; monitor NLL to avoid divergence.
Key Formulas
Temperature-Scaled Softmax
Explanation: Converts logits into probabilities while controlling sharpness via τ. Small τ makes p more peaked; large τ makes p flatter.
Low-Temperature Limit
Explanation: As τ approaches zero from the positive side, the distribution collapses to a one-hot vector at the largest logit (deterministic argmax).
High-Temperature Limit
Explanation: With very large τ, all logits become nearly indistinguishable after scaling, yielding an approximately uniform distribution.
Softmax Jacobian with Temperature
Explanation: Shows how changing one logit affects each probability. The 1/τ factor scales gradient magnitudes, important for learning and calibration.
Per-Example NLL Under Temperature
Explanation: The negative log-likelihood of the true class under a temperature-scaled softmax. Used to fit τ on a validation set.
Gradient w.r.t. Temperature
Explanation: The derivative of NLL with respect to τ for one example. Summing over the dataset gives the gradient used in 1D optimization of τ.
Gradient w.r.t. Log-Temperature
Explanation: Optimizing over s = log τ ensures τ>0 automatically and often improves numerical stability of the optimizer.
Log-Sum-Exp Trick
Explanation: Rewriting with the maximum prevents overflow when exponentials are large. Essential for stable softmax computations.
Maximum Entropy Characterization
Explanation: Softmax with temperature is the solution to maximizing expected score plus τ times entropy. τ trades off reward and randomness.
Partition Function
Explanation: The normalizer ensuring probabilities sum to 1. Its logarithm appears in the NLL and is computed stably via log-sum-exp.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute temperature-scaled softmax probabilities in a numerically stable way. 5 // - logits: vector of real-valued scores (size K) 6 // - tau: positive temperature scalar (tau > 0) 7 // Returns: vector of probabilities summing to 1. 8 vector<double> softmax_temperature(const vector<double>& logits, double tau) { 9 if (!(tau > 0.0)) { 10 throw invalid_argument("Temperature tau must be > 0"); 11 } 12 const int K = (int)logits.size(); 13 if (K == 0) return {}; 14 15 // Find max logit for numerical stability 16 double m = *max_element(logits.begin(), logits.end()); 17 18 // Compute exponentials of shifted logits divided by tau 19 vector<double> exps(K); 20 double sumExp = 0.0; 21 for (int i = 0; i < K; ++i) { 22 double x = (logits[i] - m) / tau; // ensures max exponent is exp(0) = 1 23 double e = exp(x); 24 exps[i] = e; 25 sumExp += e; 26 } 27 28 // Normalize to probabilities 29 vector<double> probs(K); 30 for (int i = 0; i < K; ++i) { 31 probs[i] = exps[i] / sumExp; 32 } 33 return probs; 34 } 35 36 int main() { 37 vector<double> z = {3.1, -0.2, 0.7, 1.5}; 38 for (double tau : {0.5, 1.0, 2.0}) { 39 auto p = softmax_temperature(z, tau); 40 cout << fixed << setprecision(6); 41 cout << "tau=" << tau << ": "; 42 for (double v : p) cout << v << ' '; 43 cout << '\n'; 44 } 45 return 0; 46 } 47
This program implements softmax with temperature using the log-sum-exp trick by subtracting the maximum logit before exponentiation. Dividing by τ controls how peaked or flat the distribution becomes. The output vectors are valid probability distributions that sum to one.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Stable softmax with temperature (as above) 5 vector<double> softmax_temperature(const vector<double>& logits, double tau) { 6 const int K = (int)logits.size(); 7 double m = *max_element(logits.begin(), logits.end()); 8 vector<double> exps(K); 9 double sumExp = 0.0; 10 for (int i = 0; i < K; ++i) { 11 double x = (logits[i] - m) / tau; 12 exps[i] = exp(x); 13 sumExp += exps[i]; 14 } 15 vector<double> p(K); 16 for (int i = 0; i < K; ++i) p[i] = exps[i] / sumExp; 17 return p; 18 } 19 20 // Compute total Negative Log-Likelihood and gradient w.r.t s = log(tau) 21 // over a dataset of logits and integer labels in [0, K-1]. 22 struct LossGrad { 23 double loss; // total NLL 24 double grad_s; // dL/ds where s = log(tau) 25 }; 26 27 LossGrad loss_and_grad_logtau(const vector<vector<double>>& logits_list, 28 const vector<int>& labels, 29 double tau) { 30 if (!(tau > 0.0)) throw invalid_argument("tau must be > 0"); 31 const int N = (int)logits_list.size(); 32 if ((int)labels.size() != N) throw invalid_argument("mismatched sizes"); 33 34 double total_loss = 0.0; 35 double total_grad_s = 0.0; 36 37 for (int n = 0; n < N; ++n) { 38 const auto& z = logits_list[n]; 39 int y = labels[n]; 40 if (y < 0 || y >= (int)z.size()) throw out_of_range("label out of range"); 41 42 // Softmax probabilities under current tau 43 vector<double> p = softmax_temperature(z, tau); 44 45 // Compute numerically stable per-example NLL 46 // NLL = -log p_y 47 double py = max(1e-15, p[y]); // clamp to avoid log(0) 48 total_loss += -log(py); 49 50 // Gradient w.r.t s = log(tau): dL/ds = (1/tau) * (z_y - sum_j p_j * z_j) 51 double s_pz = 0.0; 52 for (size_t j = 0; j < z.size(); ++j) s_pz += p[j] * z[j]; 53 total_grad_s += (1.0 / tau) * (z[y] - s_pz); 54 } 55 56 return {total_loss, total_grad_s}; 57 } 58 59 // Fit tau by gradient descent over s = log(tau) 60 // logits_list: N x K matrix of logits; labels: size N 61 // Returns the fitted tau > 0 62 double fit_temperature(const vector<vector<double>>& logits_list, 63 const vector<int>& labels, 64 double init_tau = 1.0, 65 int max_iters = 200, 66 double lr = 0.1) { 67 if (!(init_tau > 0.0)) throw invalid_argument("init_tau must be > 0"); 68 double s = log(init_tau); // optimize over s to keep tau positive 69 70 for (int it = 0; it < max_iters; ++it) { 71 double tau = exp(s); 72 auto lg = loss_and_grad_logtau(logits_list, labels, tau); 73 74 // Simple step with decaying learning rate 75 double step = lr / sqrt(1.0 + it); 76 s -= step * lg.grad_s; // gradient descent on s 77 78 // Optional: clamp s to a reasonable range to avoid extremes 79 s = min(5.0, max(-5.0, s)); // tau in [~0.0067, ~148] 80 81 if (it % 20 == 0) { 82 cerr << "iter=" << it << ", tau=" << exp(s) << ", NLL=" << lg.loss << "\n"; 83 } 84 } 85 return exp(s); 86 } 87 88 int main() { 89 // Tiny synthetic dataset: three 3-class examples with logits and labels 90 vector<vector<double>> logits_list = { 91 {3.0, 1.0, 0.0}, 92 {2.5, 2.0, -1.0}, 93 {0.5, 0.2, 3.1} 94 }; 95 vector<int> labels = {0, 1, 2}; 96 97 double tau = fit_temperature(logits_list, labels, /*init_tau=*/1.0, /*max_iters=*/150, /*lr=*/0.2); 98 cout << fixed << setprecision(6); 99 cout << "Fitted tau = " << tau << "\n"; 100 101 // Show calibrated probabilities for first sample 102 auto p = softmax_temperature(logits_list[0], tau); 103 cout << "Probs after calibration: "; 104 for (double v : p) cout << v << ' '; 105 cout << "\n"; 106 return 0; 107 } 108
This example learns a single scalar temperature τ by minimizing the negative log-likelihood on a validation set using gradient descent over s = log τ. The gradient uses dL/ds = (1/τ)(z_y − Σ p_j z_j). Optimizing s ensures τ remains positive. In practice, you would run this on a larger held-out set and stop when NLL stops improving.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Sample an index according to softmax(logits / tau) using stable weights. 5 int sample_softmax(const vector<double>& logits, double tau, mt19937& rng) { 6 if (!(tau > 0.0)) throw invalid_argument("tau must be > 0"); 7 const int K = (int)logits.size(); 8 if (K == 0) throw invalid_argument("empty logits"); 9 10 double m = *max_element(logits.begin(), logits.end()); 11 vector<double> weights(K); 12 for (int i = 0; i < K; ++i) { 13 // We can pass unnormalized, non-negative weights to discrete_distribution 14 weights[i] = exp((logits[i] - m) / tau); 15 } 16 discrete_distribution<int> dist(weights.begin(), weights.end()); 17 return dist(rng); 18 } 19 20 int main() { 21 vector<double> z = {2.0, 1.0, 0.0}; 22 mt19937 rng(123); 23 24 // Compare sampling at different temperatures 25 for (double tau : {0.5, 1.0, 2.0}) { 26 array<int, 3> counts = {0, 0, 0}; 27 for (int t = 0; t < 10000; ++t) { 28 int idx = sample_softmax(z, tau, rng); 29 counts[idx]++; 30 } 31 cout << fixed << setprecision(4); 32 cout << "tau=" << tau << ": frequencies ~ [" 33 << (counts[0]/10000.0) << ", " 34 << (counts[1]/10000.0) << ", " 35 << (counts[2]/10000.0) << "]\n"; 36 } 37 return 0; 38 } 39
This code samples indices according to a temperature-scaled softmax by turning stabilized exponentials into weights for std::discrete_distribution. With low τ the highest-logit class dominates the samples; with high τ the samples spread across classes more uniformly.