🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
∑MathIntermediate

Softmax & Temperature Scaling

Key Points

  • •
    Softmax turns arbitrary real-valued scores (logits) into probabilities that sum to one.
  • •
    Temperature scaling divides logits by a positive scalar τ to control how sharp or smooth the softmax distribution is.
  • •
    Small τ (< 1) makes the distribution peakier (more confident); large τ (> 1) makes it flatter (more uncertain).
  • •
    In the limit, τ → 0+ yields a near one-hot at the argmax, and τ → ∞ approaches a uniform distribution.
  • •
    Numerically stable softmax uses the log-sum-exp trick by subtracting the maximum logit before exponentiating.
  • •
    For calibration, τ is fit on a validation set by minimizing negative log-likelihood without changing predicted labels’ order.
  • •
    The softmax Jacobian scales with 1/τ, which affects gradients’ magnitude during learning and calibration.
  • •
    In C++, implement stable softmax, temperature calibration via 1D optimization, and sampling using discreted​istribution.

Prerequisites

  • →Exponentials and Logarithms — Softmax uses exponentials and the log-sum-exp trick relies on logarithm properties.
  • →Probability Distributions — Softmax outputs must be understood as probabilities that sum to one.
  • →Vectors and Argmax — Softmax acts on vectors and preserves ordering; argmax is key to understanding limits as τ → 0.
  • →Cross-Entropy / Negative Log-Likelihood — Temperature calibration optimizes NLL on a validation set.
  • →Gradient and Chain Rule — Derivatives of softmax with temperature and τ optimization require gradient computations.
  • →Floating-Point Arithmetic — Numerical stability (overflow/underflow) motivates the log-sum-exp trick and use of double precision.
  • →Random Number Generation in C++ — Sampling from temperature-scaled distributions uses std::mt19937 and std::discrete_distribution.
  • →Optimization Basics — Fitting τ involves 1D optimization such as gradient descent or line search.

Detailed Explanation

Tap terms for definitions

01Overview

Softmax is a function that converts a vector of real numbers (called logits) into a probability distribution. Each output is between 0 and 1, and all outputs sum to 1. Temperature scaling is a simple modification that divides the logits by a positive scalar τ (tau) before applying softmax. This single parameter lets you control how confident or uncertain the resulting distribution appears.

In machine learning, logits often come from the final linear layer of a classifier. The plain softmax already transforms them into probabilities, but those probabilities can be overconfident or underconfident relative to true frequencies. Temperature scaling addresses this by stretching or compressing the logits. When τ is small (e.g., 0.5), differences between logits become amplified and the distribution becomes sharply peaked. When τ is large (e.g., 2.0), those differences are damped, producing a smoother, more uniform distribution.

Beyond calibration, temperature is widely used in sampling from language models and reinforcement learning policies to trade off exploitation (choose the current best) and exploration (try other options). Numerically stable computation is critical; we use the log-sum-exp trick to prevent overflow/underflow when exponentiating large or small numbers. In practice, τ can be learned on a validation set by minimizing the negative log-likelihood, improving the alignment between predicted probabilities and observed outcomes without changing which class is predicted.

02Intuition & Analogies

Imagine you are looking at a leaderboard of test scores. Softmax is like turning those scores into the chance that each student is the top performer, in a way that respects how far apart the scores are. If one student scores far above the rest, their chance becomes very high; if many are close, the chances are more evenly spread.

Now add a “temperature” knob that controls how much you care about small differences. With a low temperature (a strict judge), even tiny score differences lead to big swings in who is favored—softmax becomes spiky and concentrates on the top scorer. With a high temperature (a relaxed judge), you shrug at small differences and give everyone more similar chances—softmax becomes flatter. This is like adjusting your sensitivity: low τ magnifies differences; high τ smooths them.

In language generation, temperature is a creativity dial. Low temperature makes the model conservative, sticking closely to the most likely next word. High temperature encourages more diverse, surprising words. In calibration, temperature acts like a humility dial. If your model tends to be too sure of itself, increasing τ pulls probabilities toward the middle, making them better reflect real-world frequencies, while usually preserving which class ranks first. All the while, softmax keeps the results as valid probabilities that add up to 1.

Because exponentials grow fast, we must compute softmax carefully. Subtracting the largest score before exponentiating keeps numbers in a safe range, just like zeroing a scale before weighing items to avoid overload.

03Formal Definition

Given logits z = (z1​, …, zK​) ∈ RK and temperature τ > 0, the temperature-scaled softmax is \[ pi​(τ) = softmax\!\left(\frac{z}{\tau}\right)_{i} = \frac{exp\!\left(\frac{z_{i}}{\tau}\right)}{∑j=1K​ exp\!\left(\frac{z_{j}}{\tau}\right)} for  i=1,…,K. \] The denominator is the partition function: \( Z(τ) = ∑j=1K​ exp(zj​/τ) \). Ordering is preserved for any \(τ>0\): if \(z_{a} > z_{b}\), then \(pa​(τ) > pb​(τ)\). Limits: \(limτ→0+​ p(τ)\) approaches a one-hot vector at the argmax of z, and \(limτ→∞​ p(τ) = (1/K,…,1/K)\). The Jacobian of softmax with temperature is \[ ∂zk​∂pi​​ = τ1​ pi​ (δik​ - pk​), \] where \(δik​\) is the Kronecker delta. This shows gradients scale inversely with \(τ\). For a labeled example with true class y, the negative log-likelihood is \[ L(τ) = -log py​(τ) = -τzy​​ + log ∑j=1K​ exp\!\left(\frac{z_{j}}{\tau}\right). \] Its derivative w.r.t. \(τ\) is \[ ∂τ∂L​ = \frac{1}{\tau^{2}}\Big( zy​ - ∑j=1K​ pj​(τ) zj​ \Big). \] Optimizing a single global \(τ\) over a validation set calibrates probabilities without changing predicted labels.

04When to Use

  • Probability calibration: After training a classifier, fit a scalar temperature on a held-out validation set to make predicted probabilities better match observed frequencies (improves metrics like ECE or NLL).
  • Sampling/trade-off control: In language models or policy gradients, adjust (\tau) to balance exploitation (low (\tau)) vs exploration (high (\tau)).
  • Knowledge distillation: Use (\tau>1) to soften teacher distributions, making it easier for the student to learn from dark knowledge (non-argmax classes).
  • Differentiable argmax approximation: Use small (\tau) softmax as a smooth proxy for argmax in optimization or attention mechanisms.
  • Beam search and top-k sampling: Combine with (\tau) to tune diversity and risk during decoding.
  • Score normalization: When combining heterogeneous scores (e.g., ensembling), softmax with a chosen (\tau) can normalize and control contrast. Avoid temperature scaling if you need calibrated uncertainty per-input feature or per-class (use more expressive calibration), or when labels are noisy in a way that temperature alone cannot fix.

⚠️Common Mistakes

  • Using τ ≤ 0: Temperature must be strictly positive; optimize over s = log τ to enforce positivity.
  • Ignoring numerical stability: Computing exp(z/τ) directly can overflow; always subtract max(z) before exponentiating (log-sum-exp trick).
  • Confusing logits and probabilities: Do not apply softmax to probabilities again; temperature scaling expects raw logits.
  • Misinterpreting direction: τ < 1 sharpens (more confident); τ > 1 flattens (less confident). Some libraries define inverse temperature β = 1/τ—read docs carefully.
  • Calibrating on training data: Fit τ on a held-out validation set; fitting on training data leads to optimistic (biased) calibration estimates.
  • Expecting accuracy gains: Temperature scaling usually does not change top-1 accuracy because it preserves argmax order; it improves probability calibration and NLL.
  • Forgetting label bounds: Ensure labels are valid indices [0, K-1]; off-by-one errors silently corrupt NLL.
  • Overfitting τ with too many degrees of freedom: Scalar τ is robust; per-class τ can overfit unless you have lots of data.
  • Using low-precision types: For extreme logits or tiny τ, prefer double precision to reduce underflow/overflow.
  • Not constraining optimization: When learning τ by gradient descent, cap step sizes and iterate until convergence; monitor NLL to avoid divergence.

Key Formulas

Temperature-Scaled Softmax

pi​(τ)=∑j=1K​exp(τzj​​)exp(τzi​​)​

Explanation: Converts logits into probabilities while controlling sharpness via τ. Small τ makes p more peaked; large τ makes p flatter.

Low-Temperature Limit

τ→0+lim​softmax(τz​)=eargmax(z)​

Explanation: As τ approaches zero from the positive side, the distribution collapses to a one-hot vector at the largest logit (deterministic argmax).

High-Temperature Limit

τ→∞lim​softmax(τz​)=(K1​,…,K1​)

Explanation: With very large τ, all logits become nearly indistinguishable after scaling, yielding an approximately uniform distribution.

Softmax Jacobian with Temperature

∂zk​∂pi​​=τ1​pi​(δik​−pk​)

Explanation: Shows how changing one logit affects each probability. The 1/τ factor scales gradient magnitudes, important for learning and calibration.

Per-Example NLL Under Temperature

L(τ)=−logpy​(τ)=−τzy​​+logj=1∑K​exp(τzj​​)

Explanation: The negative log-likelihood of the true class under a temperature-scaled softmax. Used to fit τ on a validation set.

Gradient w.r.t. Temperature

∂τ∂L​=τ21​(zy​−j=1∑K​pj​(τ)zj​)

Explanation: The derivative of NLL with respect to τ for one example. Summing over the dataset gives the gradient used in 1D optimization of τ.

Gradient w.r.t. Log-Temperature

∂s∂L​=∂τ∂L​⋅∂s∂τ​=τ1​(zy​−j=1∑K​pj​(τ)zj​),s=logτ

Explanation: Optimizing over s = log τ ensures τ>0 automatically and often improves numerical stability of the optimizer.

Log-Sum-Exp Trick

logi=1∑K​exp(ai​)=m+logi=1∑K​exp(ai​−m),m=imax​ai​

Explanation: Rewriting with the maximum prevents overflow when exponentials are large. Essential for stable softmax computations.

Maximum Entropy Characterization

p=argq∈ΔK−1max​(i=1∑K​qi​zi​+τ⋅H(q)),H(q)=−i∑​qi​logqi​

Explanation: Softmax with temperature is the solution to maximizing expected score plus τ times entropy. τ trades off reward and randomness.

Partition Function

Z(τ)=j=1∑K​exp(τzj​​)

Explanation: The normalizer ensuring probabilities sum to 1. Its logarithm appears in the NLL and is computed stably via log-sum-exp.

Complexity Analysis

For a single softmax with temperature over K classes, we typically compute the maximum logit (O(K)), subtract it, exponentiate each shifted logit divided by τ (O(K)), and normalize by their sum (O(K)). Thus, time complexity is O(K) and space complexity is O(K) to hold the output probabilities (or O(1) extra if computed in place). Using the log-sum-exp trick does not change complexity; it only improves numerical stability. When fitting a scalar temperature τ on a validation set of N examples with K classes each, one pass to compute the negative log-likelihood and its gradient costs O(NK) time (each example requires a softmax). A simple gradient descent over I iterations has total time O(INK). Memory is O(K) per example for transient buffers, and O(1) additional for the running loss/gradient; you need not store probabilities for all examples simultaneously if you stream through the data. Sampling from a temperature-scaled categorical distribution by computing weights wi​=exp((zi​−m)/τ) and using std::discreted​istribution takes O(K) to prepare and O(1) to draw a single sample (amortized by the distribution’s internal aliasing is not guaranteed; the standard implementation is typically O(K) initialization and O(1) or O(log K) sampling). If you resample many times with fixed logits and τ, precomputing normalized probabilities once (O(K)) is efficient. Practically, double-precision arithmetic is recommended when τ is very small or when logits have large magnitude; the overhead is minor compared to the O(K) operations and significantly reduces the risk of overflow/underflow during exponentiation and normalization.

Code Examples

Numerically Stable Softmax with Temperature
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute temperature-scaled softmax probabilities in a numerically stable way.
5// - logits: vector of real-valued scores (size K)
6// - tau: positive temperature scalar (tau > 0)
7// Returns: vector of probabilities summing to 1.
8vector<double> softmax_temperature(const vector<double>& logits, double tau) {
9 if (!(tau > 0.0)) {
10 throw invalid_argument("Temperature tau must be > 0");
11 }
12 const int K = (int)logits.size();
13 if (K == 0) return {};
14
15 // Find max logit for numerical stability
16 double m = *max_element(logits.begin(), logits.end());
17
18 // Compute exponentials of shifted logits divided by tau
19 vector<double> exps(K);
20 double sumExp = 0.0;
21 for (int i = 0; i < K; ++i) {
22 double x = (logits[i] - m) / tau; // ensures max exponent is exp(0) = 1
23 double e = exp(x);
24 exps[i] = e;
25 sumExp += e;
26 }
27
28 // Normalize to probabilities
29 vector<double> probs(K);
30 for (int i = 0; i < K; ++i) {
31 probs[i] = exps[i] / sumExp;
32 }
33 return probs;
34}
35
36int main() {
37 vector<double> z = {3.1, -0.2, 0.7, 1.5};
38 for (double tau : {0.5, 1.0, 2.0}) {
39 auto p = softmax_temperature(z, tau);
40 cout << fixed << setprecision(6);
41 cout << "tau=" << tau << ": ";
42 for (double v : p) cout << v << ' ';
43 cout << '\n';
44 }
45 return 0;
46}
47

This program implements softmax with temperature using the log-sum-exp trick by subtracting the maximum logit before exponentiation. Dividing by τ controls how peaked or flat the distribution becomes. The output vectors are valid probability distributions that sum to one.

Time: O(K) per callSpace: O(K)
Fit a Scalar Temperature by Minimizing Validation NLL
1#include <bits/stdc++.h>
2using namespace std;
3
4// Stable softmax with temperature (as above)
5vector<double> softmax_temperature(const vector<double>& logits, double tau) {
6 const int K = (int)logits.size();
7 double m = *max_element(logits.begin(), logits.end());
8 vector<double> exps(K);
9 double sumExp = 0.0;
10 for (int i = 0; i < K; ++i) {
11 double x = (logits[i] - m) / tau;
12 exps[i] = exp(x);
13 sumExp += exps[i];
14 }
15 vector<double> p(K);
16 for (int i = 0; i < K; ++i) p[i] = exps[i] / sumExp;
17 return p;
18}
19
20// Compute total Negative Log-Likelihood and gradient w.r.t s = log(tau)
21// over a dataset of logits and integer labels in [0, K-1].
22struct LossGrad {
23 double loss; // total NLL
24 double grad_s; // dL/ds where s = log(tau)
25};
26
27LossGrad loss_and_grad_logtau(const vector<vector<double>>& logits_list,
28 const vector<int>& labels,
29 double tau) {
30 if (!(tau > 0.0)) throw invalid_argument("tau must be > 0");
31 const int N = (int)logits_list.size();
32 if ((int)labels.size() != N) throw invalid_argument("mismatched sizes");
33
34 double total_loss = 0.0;
35 double total_grad_s = 0.0;
36
37 for (int n = 0; n < N; ++n) {
38 const auto& z = logits_list[n];
39 int y = labels[n];
40 if (y < 0 || y >= (int)z.size()) throw out_of_range("label out of range");
41
42 // Softmax probabilities under current tau
43 vector<double> p = softmax_temperature(z, tau);
44
45 // Compute numerically stable per-example NLL
46 // NLL = -log p_y
47 double py = max(1e-15, p[y]); // clamp to avoid log(0)
48 total_loss += -log(py);
49
50 // Gradient w.r.t s = log(tau): dL/ds = (1/tau) * (z_y - sum_j p_j * z_j)
51 double s_pz = 0.0;
52 for (size_t j = 0; j < z.size(); ++j) s_pz += p[j] * z[j];
53 total_grad_s += (1.0 / tau) * (z[y] - s_pz);
54 }
55
56 return {total_loss, total_grad_s};
57}
58
59// Fit tau by gradient descent over s = log(tau)
60// logits_list: N x K matrix of logits; labels: size N
61// Returns the fitted tau > 0
62double fit_temperature(const vector<vector<double>>& logits_list,
63 const vector<int>& labels,
64 double init_tau = 1.0,
65 int max_iters = 200,
66 double lr = 0.1) {
67 if (!(init_tau > 0.0)) throw invalid_argument("init_tau must be > 0");
68 double s = log(init_tau); // optimize over s to keep tau positive
69
70 for (int it = 0; it < max_iters; ++it) {
71 double tau = exp(s);
72 auto lg = loss_and_grad_logtau(logits_list, labels, tau);
73
74 // Simple step with decaying learning rate
75 double step = lr / sqrt(1.0 + it);
76 s -= step * lg.grad_s; // gradient descent on s
77
78 // Optional: clamp s to a reasonable range to avoid extremes
79 s = min(5.0, max(-5.0, s)); // tau in [~0.0067, ~148]
80
81 if (it % 20 == 0) {
82 cerr << "iter=" << it << ", tau=" << exp(s) << ", NLL=" << lg.loss << "\n";
83 }
84 }
85 return exp(s);
86}
87
88int main() {
89 // Tiny synthetic dataset: three 3-class examples with logits and labels
90 vector<vector<double>> logits_list = {
91 {3.0, 1.0, 0.0},
92 {2.5, 2.0, -1.0},
93 {0.5, 0.2, 3.1}
94 };
95 vector<int> labels = {0, 1, 2};
96
97 double tau = fit_temperature(logits_list, labels, /*init_tau=*/1.0, /*max_iters=*/150, /*lr=*/0.2);
98 cout << fixed << setprecision(6);
99 cout << "Fitted tau = " << tau << "\n";
100
101 // Show calibrated probabilities for first sample
102 auto p = softmax_temperature(logits_list[0], tau);
103 cout << "Probs after calibration: ";
104 for (double v : p) cout << v << ' ';
105 cout << "\n";
106 return 0;
107}
108

This example learns a single scalar temperature τ by minimizing the negative log-likelihood on a validation set using gradient descent over s = log τ. The gradient uses dL/ds = (1/τ)(z_y − Σ p_j z_j). Optimizing s ensures τ remains positive. In practice, you would run this on a larger held-out set and stop when NLL stops improving.

Time: O(INK), where I is iterations, N is examples, K is classesSpace: O(K) per example (streaming), O(1) extra for scalars
Sampling from a Temperature-Scaled Softmax (Categorical Draw)
1#include <bits/stdc++.h>
2using namespace std;
3
4// Sample an index according to softmax(logits / tau) using stable weights.
5int sample_softmax(const vector<double>& logits, double tau, mt19937& rng) {
6 if (!(tau > 0.0)) throw invalid_argument("tau must be > 0");
7 const int K = (int)logits.size();
8 if (K == 0) throw invalid_argument("empty logits");
9
10 double m = *max_element(logits.begin(), logits.end());
11 vector<double> weights(K);
12 for (int i = 0; i < K; ++i) {
13 // We can pass unnormalized, non-negative weights to discrete_distribution
14 weights[i] = exp((logits[i] - m) / tau);
15 }
16 discrete_distribution<int> dist(weights.begin(), weights.end());
17 return dist(rng);
18}
19
20int main() {
21 vector<double> z = {2.0, 1.0, 0.0};
22 mt19937 rng(123);
23
24 // Compare sampling at different temperatures
25 for (double tau : {0.5, 1.0, 2.0}) {
26 array<int, 3> counts = {0, 0, 0};
27 for (int t = 0; t < 10000; ++t) {
28 int idx = sample_softmax(z, tau, rng);
29 counts[idx]++;
30 }
31 cout << fixed << setprecision(4);
32 cout << "tau=" << tau << ": frequencies ~ ["
33 << (counts[0]/10000.0) << ", "
34 << (counts[1]/10000.0) << ", "
35 << (counts[2]/10000.0) << "]\n";
36 }
37 return 0;
38}
39

This code samples indices according to a temperature-scaled softmax by turning stabilized exponentials into weights for std::discrete_distribution. With low τ the highest-logit class dominates the samples; with high τ the samples spread across classes more uniformly.

Time: O(K) to set up weights; sampling is typically O(1) per draw after initializationSpace: O(K) for the weights
#softmax#temperature scaling#logits#probability calibration#negative log-likelihood#log-sum-exp#gibbs distribution#entropy#sampling#language models#categorical distribution#jacobian#exploration-exploitation#ece calibration#numerical stability