Focal Loss
Key Points
- •Focal Loss reshapes cross-entropy so that hard, misclassified examples get more focus while easy, well-classified ones are down-weighted.
- •It introduces a focusing parameter γ (gamma) and a class-balancing weight α (alpha) to handle class imbalance and reduce the dominance of easy negatives.
- •For binary classification, Focal Loss is FL() = - (1 - )^{γ} log(), where is the model’s probability for the true class.
- •When γ = 0, Focal Loss reduces to standard (weighted) cross-entropy, making it a drop-in generalization.
- •Numerically stable implementations require clipping probabilities and using stable sigmoid/softmax computations to avoid log(0).
- •In practice, γ is often set between 1 and 3, and α is used to balance rare positive classes against abundant negatives.
- •Computational complexity is similar to cross-entropy: O(NC) per batch for C classes and N samples, with minimal extra overhead for the modulating term.
- •Focal Loss is especially effective in object detection and any task with extreme class imbalance or many easy negatives.
Prerequisites
- →Logistic Regression — Provides the probabilistic classification framework and introduces logits and sigmoid.
- →Cross-Entropy Loss — Focal loss generalizes cross-entropy; understanding CE clarifies the role of focusing and class weights.
- →Softmax and Sigmoid Functions — Required to map logits to probabilities used inside focal loss.
- →Chain Rule and Basic Gradients — Backpropagation through sigmoid/softmax and the focal term relies on the chain rule.
- →Numerical Stability (Log-Sum-Exp, Clipping) — Prevents overflow/underflow when computing logs and exponentials.
- →C++ Vectors and Loops — Needed to implement and understand the provided code examples.
- →Matrix/Vector Inner Products — Used to compute logits in linear models and in gradient updates.
- →Imbalanced Classification Metrics — Precision/recall and confusion matrices contextualize why focal loss helps.
Detailed Explanation
Tap terms for definitions01Overview
Focal Loss is a modification of cross-entropy designed to tackle class imbalance and the overwhelming presence of easy examples during training. In many real-world datasets (e.g., detecting rare events), standard cross-entropy lets the numerous easy negatives dominate the gradient, causing the model to under-learn from rare but important positives. Focal Loss addresses this by down-weighting easy examples and focusing learning on hard, misclassified ones. It achieves this through two knobs: a focusing parameter gamma (γ) that reduces the contribution of well-classified examples, and a weighting parameter alpha (α) that balances classes by giving minority classes a proportionally higher weight. Intuitively, when the model is already confident about an example, its loss and gradient shrink sharply; when the model is wrong or uncertain, the loss remains significant. This behavior leads to faster progress on underrepresented or difficult examples without heavily changing the training pipeline. Importantly, when γ = 0, Focal Loss recovers standard weighted cross-entropy, so it’s a conservative extension rather than a completely new objective.
02Intuition & Analogies
Imagine you’re grading homework for a very large class. Most students already get the easy questions right; spending more time on them won’t improve learning much. Instead, you want to devote your attention to the students who consistently miss certain questions. Focal Loss is that strategy in math form. It looks at the model’s confidence p_t (the probability it assigns to the correct class). If p_t is high, the example is easy; Focal Loss multiplies the usual cross-entropy by (1 - p_t)^γ, which is small, so you spend little effort there. If p_t is low, the example is hard; the multiplier is near 1, so the loss stays large and the model focuses on it. The α parameter plays the role of ensuring minority students (rare classes) are not ignored simply because they are few; it scales their contribution up so they remain visible in the learning process. Hook: think of a tutor who watches which questions a student gets wrong and leans into those. Concept: Focal Loss down-weights easy cases and up-weights hard ones and rare classes. Example: in object detection, there are far more background boxes than object boxes; Focal Loss prevents the overwhelming number of background (easy negatives) from drowning out the signal from true objects.
03Formal Definition
04When to Use
Use Focal Loss when your dataset has severe class imbalance or contains many easy negatives that dominate training. Typical scenarios include object detection (background vs. objects), anomaly and fraud detection (rare positives), medical diagnosis (rare disease), and extreme multi-class problems where a few classes dominate. It is also useful when your metrics prioritize recall on rare classes or when you see the model quickly overfitting to easy examples while underperforming on hard ones. Choose \gamma in [1,3] as a starting point; higher \gamma increases focus on hard examples but can slow learning if set too high. Use \alpha when class frequencies are highly skewed or when misclassification costs differ: set \alpha higher for the rare or more costly class. If your data are relatively balanced and you don’t observe domination by easy examples, standard cross-entropy (\gamma=0) may suffice and train faster with simpler tuning.
⚠️Common Mistakes
- Forgetting to pass probabilities, not logits, into the focal term (log and (1-p_t)^\gamma); always apply sigmoid/softmax first or use log-sum-exp tricks for stability.\n- Numerical instability: computing \log(0) or raising near-1/near-0 to \gamma. Always clip probabilities into [\epsilon, 1-\epsilon] with a small \epsilon like 1e-8.\n- Misusing α: setting the same α for both classes in a highly imbalanced dataset defeats the purpose. Typically, set α close to the minority class prior or tune on a validation set.\n- Over-large γ: very high γ can excessively down-weight easy examples and slow or destabilize training. Start with γ in [1,3] and adjust modestly.\n- Incorrect gradient through softmax/sigmoid: implement derivatives carefully; for multiclass, use the softmax Jacobian p_c(\delta_{ck} - p_k).\n- Mixing label encodings: for binary versions, ensure y ∈ {0,1} (not {-1,1}) unless you re-derive the formula.\n- Averaging/normalization mistakes: be consistent when averaging losses over a batch, especially when using per-class α weights; otherwise gradients can become scale-dependent in unintended ways.
Key Formulas
Focal Loss (Binary form)
Explanation: The focal loss scales cross-entropy by (1 - )^{} to reduce the impact of well-classified examples. balances class contributions when data are imbalanced.
Definition of p_t and Sigmoid
Explanation: denotes the model’s probability assigned to the true class. The sigmoid maps the logit z to probability p in binary settings.
Multiclass Focal Loss
Explanation: For one-hot labels , only the true class contributes. Softmax converts logits to a valid probability distribution over classes.
Derivative w.r.t. p_t
Explanation: This shows how the loss changes with the probability of the true class. It combines the effect of the modulating factor and the logarithm of the probability.
Binary Gradient via Chain Rule
Explanation: To backpropagate through logits, multiply the derivative w.r.t. probability by the sigmoid derivative p(1-p). This yields gradients for logistic models.
Softmax Jacobian
Explanation: This relation allows computing gradients of losses through softmax. For focal loss, only the true class probability’s derivative is needed, then multiplied by this Jacobian.
Reduction to Cross-Entropy
Explanation: When the focusing parameter is zero, the modulating factor becomes 1, and focal loss reduces to weighted cross-entropy.
Sensitivity to Gamma
Explanation: This derivative shows how the loss changes with . It is useful for analyzing or even learning , though typically is tuned as a hyperparameter.
Cross-Entropy
Explanation: Standard classification loss without focusing or class rebalancing. Focal loss generalizes this by adding class weights and the modulating factor.
Computational Complexity
Explanation: Computing probabilities and the focal term per class is linear in C per example, leading to O(NC) per batch. Memory is dominated by storing activations and gradients.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct BinaryFocalLossResult { 5 double avg_loss; // average loss over batch 6 vector<double> dL_dz; // gradient w.r.t. logits z for each sample 7 }; 8 9 // Compute sigmoid with numerical stability 10 static inline double sigmoid(double z) { 11 if (z >= 0) { 12 double ez = exp(-z); 13 return 1.0 / (1.0 + ez); 14 } else { 15 double ez = exp(z); 16 return ez / (1.0 + ez); 17 } 18 } 19 20 BinaryFocalLossResult binary_focal_loss(const vector<double>& logits, 21 const vector<int>& labels, // 0 or 1 22 double alpha_pos = 0.25, // alpha for y=1 23 double gamma = 2.0, 24 double eps = 1e-8) { 25 size_t n = logits.size(); 26 BinaryFocalLossResult out; 27 out.dL_dz.assign(n, 0.0); 28 29 double loss_sum = 0.0; 30 for (size_t i = 0; i < n; ++i) { 31 int y = labels[i]; 32 double z = logits[i]; 33 double p = sigmoid(z); // probability for class 1 34 // p_t depends on label 35 double p_t = (y == 1) ? p : (1.0 - p); 36 // alpha_t balances classes 37 double alpha_t = (y == 1) ? alpha_pos : (1.0 - alpha_pos); 38 39 // Clip to avoid log(0) and division by zero 40 p_t = min(max(p_t, eps), 1.0 - eps); 41 p = min(max(p, eps), 1.0 - eps); 42 43 // Focal Loss for this example 44 double mod = pow(1.0 - p_t, gamma); // (1 - p_t)^gamma 45 double L = -alpha_t * mod * log(p_t); 46 loss_sum += L; 47 48 // Derivative dL/dp_t 49 double dL_dp_t = alpha_t * gamma * pow(1.0 - p_t, gamma - 1.0) * log(p_t) 50 - alpha_t * pow(1.0 - p_t, gamma) / p_t; 51 52 // Convert to dL/dp using dp_t/dp = +1 if y=1, -1 if y=0 53 double s = (y == 1) ? 1.0 : -1.0; 54 double dL_dp = dL_dp_t * s; 55 56 // Chain rule to logits: dp/dz = p(1-p) 57 double dL_dz = dL_dp * (p * (1.0 - p)); 58 out.dL_dz[i] = dL_dz; 59 } 60 61 out.avg_loss = loss_sum / static_cast<double>(n); 62 return out; 63 } 64 65 int main() { 66 // Toy batch: logits and labels with imbalance 67 vector<double> logits = {-3.0, -1.0, 0.0, 2.0, 4.0}; 68 vector<int> labels = {1, 0, 0, 1, 0}; 69 70 auto res = binary_focal_loss(logits, labels, /*alpha_pos=*/0.75, /*gamma=*/2.0); 71 72 cout << fixed << setprecision(6); 73 cout << "Average focal loss: " << res.avg_loss << "\n"; 74 cout << "Gradients dL/dz: "; 75 for (double g : res.dL_dz) cout << g << ' '; 76 cout << "\n"; 77 return 0; 78 } 79
Computes the binary focal loss and its gradient with respect to logits for a mini-batch. We use a numerically stable sigmoid, clip probabilities for stability, and apply the chain rule to backpropagate through sigmoid. α is assigned per class (positives vs. negatives), and γ controls the modulating factor. The output gradients can be used in a custom training loop.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct MultiFocalGrad { 5 double loss; 6 vector<double> dL_dz; // gradient w.r.t. logits (size C) 7 }; 8 9 static inline double logsumexp(const vector<double>& v) { 10 double m = *max_element(v.begin(), v.end()); 11 double sum = 0.0; 12 for (double x : v) sum += exp(x - m); 13 return m + log(sum); 14 } 15 16 MultiFocalGrad multiclass_focal_loss_one(const vector<double>& logits, // size C 17 int true_class, // 0..C-1 18 const vector<double>& alpha, // size C (class weights) 19 double gamma = 2.0, 20 double eps = 1e-8) { 21 int C = (int)logits.size(); 22 MultiFocalGrad out; out.dL_dz.assign(C, 0.0); 23 24 // Stable softmax 25 double lse = logsumexp(logits); 26 vector<double> p(C); 27 for (int c = 0; c < C; ++c) p[c] = exp(logits[c] - lse); 28 29 // True class probability and alpha 30 double pc = min(max(p[true_class], eps), 1.0 - eps); 31 double alpha_c = alpha.empty() ? 1.0 : alpha[true_class]; 32 33 // Loss 34 double mod = pow(1.0 - pc, gamma); 35 out.loss = -alpha_c * mod * log(pc); 36 37 // dL/dp_true 38 double dL_dpc = alpha_c * gamma * pow(1.0 - pc, gamma - 1.0) * log(pc) 39 - alpha_c * pow(1.0 - pc, gamma) / pc; 40 41 // Use softmax Jacobian: dp_true/dz_k = p_true (delta_{k,true} - p_k) 42 for (int k = 0; k < C; ++k) { 43 double dpc_dzk = p[true_class] * ((k == true_class ? 1.0 : 0.0) - p[k]); 44 out.dL_dz[k] = dL_dpc * dpc_dzk; 45 } 46 47 return out; 48 } 49 50 int main(){ 51 vector<double> z = {2.0, 0.5, -1.0}; // logits for 3 classes 52 int y = 0; // true class 53 vector<double> alpha = {1.0, 1.0, 1.0}; // uniform weights 54 55 auto res = multiclass_focal_loss_one(z, y, alpha, 2.0); 56 cout << fixed << setprecision(6); 57 cout << "Loss: " << res.loss << "\nGradients: "; 58 for (double g : res.dL_dz) cout << g << ' '; 59 cout << "\n"; 60 return 0; 61 } 62
Computes multiclass focal loss for a single example with a numerically stable softmax and returns the gradient w.r.t. each logit. Only the true class probability enters the focal term, and its gradient is propagated to all logits via the softmax Jacobian.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 static inline double sigmoid(double z) { 5 if (z >= 0) { double ez = exp(-z); return 1.0/(1.0+ez); } 6 double ez = exp(z); return ez/(1.0+ez); 7 } 8 9 struct Example { vector<double> x; int y; }; 10 11 // Compute batch focal gradients w.r.t. logits and average loss 12 void focal_batch_grad(const vector<double>& z, const vector<int>& y, 13 vector<double>& dL_dz, double& avg_loss, 14 double alpha_pos = 0.5, double gamma = 2.0, double eps=1e-8) { 15 int n = (int)z.size(); 16 dL_dz.assign(n, 0.0); 17 double loss_sum = 0.0; 18 for (int i = 0; i < n; ++i) { 19 double p = sigmoid(z[i]); 20 int yi = y[i]; 21 double pt = (yi==1)? p : (1.0-p); 22 double at = (yi==1)? alpha_pos : (1.0-alpha_pos); 23 pt = min(max(pt, eps), 1.0 - eps); 24 p = min(max(p, eps), 1.0 - eps); 25 double mod = pow(1.0 - pt, gamma); 26 double Li = -at * mod * log(pt); 27 loss_sum += Li; 28 double dL_dpt = at * gamma * pow(1.0 - pt, gamma - 1.0) * log(pt) 29 - at * pow(1.0 - pt, gamma) / pt; 30 double s = (yi==1)? 1.0 : -1.0; 31 double dL_dp = dL_dpt * s; 32 dL_dz[i] = dL_dp * (p * (1.0 - p)); 33 } 34 avg_loss = loss_sum / n; 35 } 36 37 int main(){ 38 ios::sync_with_stdio(false); 39 cin.tie(nullptr); 40 41 // Create an imbalanced synthetic dataset: positives are rare 42 int N = 2000, d = 5; 43 vector<Example> data; data.reserve(N); 44 std::mt19937 rng(123); 45 normal_distribution<double> noise(0.0, 1.0); 46 bernoulli_distribution rare_pos(0.1); // 10% positives 47 48 // True weights for data generation 49 vector<double> w_true(d, 0.0); w_true[0] = 2.5; w_true[1] = -1.5; w_true[2] = 1.0; 50 double b_true = -0.3; 51 52 auto dot = [&](const vector<double>& a, const vector<double>& b){ 53 double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }; 54 55 for (int i = 0; i < N; ++i) { 56 vector<double> x(d); 57 for (int j = 0; j < d; ++j) x[j] = noise(rng); 58 int y = rare_pos(rng) ? 1 : 0; 59 // make positives slightly separable 60 if (y==1) x[0] += 1.5, x[2] += 0.7; 61 data.push_back({x, y}); 62 } 63 64 // Initialize model 65 vector<double> w(d, 0.0); 66 double b = 0.0, lr = 0.05; 67 68 // Train with SGD 69 int epochs = 10; int batch = 64; 70 for (int ep = 0; ep < epochs; ++ep) { 71 shuffle(data.begin(), data.end(), rng); 72 double epoch_loss = 0.0; int batches = 0; 73 for (int i = 0; i < N; i += batch) { 74 int R = min(batch, N - i); 75 vector<double> z(R), dL_dz(R); 76 vector<int> y(R); 77 // forward logits 78 for (int r = 0; r < R; ++r) { 79 y[r] = data[i+r].y; 80 z[r] = inner_product(data[i+r].x.begin(), data[i+r].x.end(), w.begin(), 0.0) + b; 81 } 82 // loss + gradient wrt logits 83 double avg_loss; focal_batch_grad(z, y, dL_dz, avg_loss, /*alpha_pos=*/0.75, /*gamma=*/2.0); 84 epoch_loss += avg_loss; ++batches; 85 // backprop to weights: dL/dw = sum_i dL/dz_i * x_i, dL/db = sum_i dL/dz_i 86 vector<double> grad_w(d, 0.0); double grad_b = 0.0; 87 for (int r = 0; r < R; ++r) { 88 grad_b += dL_dz[r]; 89 for (int j = 0; j < d; ++j) grad_w[j] += dL_dz[r] * data[i+r].x[j]; 90 } 91 // SGD update 92 for (int j = 0; j < d; ++j) w[j] -= lr * (grad_w[j] / R); 93 b -= lr * (grad_b / R); 94 } 95 cout << "Epoch " << ep+1 << ": avg loss = " << (epoch_loss / batches) << "\n"; 96 } 97 98 // Evaluate 99 int TP=0, FP=0, TN=0, FN=0; 100 for (auto& e : data) { 101 double z = inner_product(e.x.begin(), e.x.end(), w.begin(), 0.0) + b; 102 double p = sigmoid(z); 103 int pred = (p >= 0.5); 104 if (e.y==1 && pred==1) ++TP; else if (e.y==1) ++FN; else if (pred==1) ++FP; else ++TN; 105 } 106 cout << "Confusion matrix: TP=" << TP << " FP=" << FP << " TN=" << TN << " FN=" << FN << "\n"; 107 double precision = TP/(double)(TP+FP+1e-12), recall = TP/(double)(TP+FN+1e-12); 108 cout << fixed << setprecision(4); 109 cout << "Precision=" << precision << " Recall=" << recall << "\n"; 110 return 0; 111 } 112
A minimal end-to-end example training a logistic regression classifier with focal loss on an imbalanced synthetic dataset. The code computes batch logits, obtains focal loss gradients, and uses SGD to update weights. It reports epoch losses and a simple confusion-matrix-based evaluation, illustrating how focal loss can emphasize rare positives via α and γ.