🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Focal Loss

Key Points

  • •
    Focal Loss reshapes cross-entropy so that hard, misclassified examples get more focus while easy, well-classified ones are down-weighted.
  • •
    It introduces a focusing parameter γ (gamma) and a class-balancing weight α (alpha) to handle class imbalance and reduce the dominance of easy negatives.
  • •
    For binary classification, Focal Loss is FL(pt​) = - αt​ (1 - pt​)^{γ} log(pt​), where pt​ is the model’s probability for the true class.
  • •
    When γ = 0, Focal Loss reduces to standard (weighted) cross-entropy, making it a drop-in generalization.
  • •
    Numerically stable implementations require clipping probabilities and using stable sigmoid/softmax computations to avoid log(0).
  • •
    In practice, γ is often set between 1 and 3, and α is used to balance rare positive classes against abundant negatives.
  • •
    Computational complexity is similar to cross-entropy: O(NC) per batch for C classes and N samples, with minimal extra overhead for the modulating term.
  • •
    Focal Loss is especially effective in object detection and any task with extreme class imbalance or many easy negatives.

Prerequisites

  • →Logistic Regression — Provides the probabilistic classification framework and introduces logits and sigmoid.
  • →Cross-Entropy Loss — Focal loss generalizes cross-entropy; understanding CE clarifies the role of focusing and class weights.
  • →Softmax and Sigmoid Functions — Required to map logits to probabilities used inside focal loss.
  • →Chain Rule and Basic Gradients — Backpropagation through sigmoid/softmax and the focal term relies on the chain rule.
  • →Numerical Stability (Log-Sum-Exp, Clipping) — Prevents overflow/underflow when computing logs and exponentials.
  • →C++ Vectors and Loops — Needed to implement and understand the provided code examples.
  • →Matrix/Vector Inner Products — Used to compute logits in linear models and in gradient updates.
  • →Imbalanced Classification Metrics — Precision/recall and confusion matrices contextualize why focal loss helps.

Detailed Explanation

Tap terms for definitions

01Overview

Focal Loss is a modification of cross-entropy designed to tackle class imbalance and the overwhelming presence of easy examples during training. In many real-world datasets (e.g., detecting rare events), standard cross-entropy lets the numerous easy negatives dominate the gradient, causing the model to under-learn from rare but important positives. Focal Loss addresses this by down-weighting easy examples and focusing learning on hard, misclassified ones. It achieves this through two knobs: a focusing parameter gamma (γ) that reduces the contribution of well-classified examples, and a weighting parameter alpha (α) that balances classes by giving minority classes a proportionally higher weight. Intuitively, when the model is already confident about an example, its loss and gradient shrink sharply; when the model is wrong or uncertain, the loss remains significant. This behavior leads to faster progress on underrepresented or difficult examples without heavily changing the training pipeline. Importantly, when γ = 0, Focal Loss recovers standard weighted cross-entropy, so it’s a conservative extension rather than a completely new objective.

02Intuition & Analogies

Imagine you’re grading homework for a very large class. Most students already get the easy questions right; spending more time on them won’t improve learning much. Instead, you want to devote your attention to the students who consistently miss certain questions. Focal Loss is that strategy in math form. It looks at the model’s confidence p_t (the probability it assigns to the correct class). If p_t is high, the example is easy; Focal Loss multiplies the usual cross-entropy by (1 - p_t)^γ, which is small, so you spend little effort there. If p_t is low, the example is hard; the multiplier is near 1, so the loss stays large and the model focuses on it. The α parameter plays the role of ensuring minority students (rare classes) are not ignored simply because they are few; it scales their contribution up so they remain visible in the learning process. Hook: think of a tutor who watches which questions a student gets wrong and leans into those. Concept: Focal Loss down-weights easy cases and up-weights hard ones and rare classes. Example: in object detection, there are far more background boxes than object boxes; Focal Loss prevents the overwhelming number of background (easy negatives) from drowning out the signal from true objects.

03Formal Definition

Let y be the true label and p be the model’s predicted probability for class 1 in binary classification. Define pt​=p if y=1, and pt​=1−p if y=0. The Focal Loss for a single example is FL(pt​) = -αt​ (1 - pt​)^{γ} log(pt​), where αt​ is a class-dependent weight (e.g., α for positives and 1-α for negatives), and γ ≥ 0 is the focusing parameter. For multiclass classification with softmax probabilities pc​ and one-hot labels yc​, the loss is FL = - ∑c=1C​ αc​ yc​ (1 - pc​)^{γ} log(pc​). When γ = 0, (1 - pt​)^{γ} = 1 and we recover weighted cross-entropy. As γ increases, the factor (1 - pt​)^{γ} suppresses losses where pt​ is close to 1 (easy cases), without significantly affecting hard cases where pt​ is small. The function remains differentiable for pt​ in (0,1), enabling gradient-based optimization using the chain rule through sigmoid (binary) or softmax (multiclass).

04When to Use

Use Focal Loss when your dataset has severe class imbalance or contains many easy negatives that dominate training. Typical scenarios include object detection (background vs. objects), anomaly and fraud detection (rare positives), medical diagnosis (rare disease), and extreme multi-class problems where a few classes dominate. It is also useful when your metrics prioritize recall on rare classes or when you see the model quickly overfitting to easy examples while underperforming on hard ones. Choose \gamma in [1,3] as a starting point; higher \gamma increases focus on hard examples but can slow learning if set too high. Use \alpha when class frequencies are highly skewed or when misclassification costs differ: set \alpha higher for the rare or more costly class. If your data are relatively balanced and you don’t observe domination by easy examples, standard cross-entropy (\gamma=0) may suffice and train faster with simpler tuning.

⚠️Common Mistakes

  • Forgetting to pass probabilities, not logits, into the focal term (log and (1-p_t)^\gamma); always apply sigmoid/softmax first or use log-sum-exp tricks for stability.\n- Numerical instability: computing \log(0) or raising near-1/near-0 to \gamma. Always clip probabilities into [\epsilon, 1-\epsilon] with a small \epsilon like 1e-8.\n- Misusing α: setting the same α for both classes in a highly imbalanced dataset defeats the purpose. Typically, set α close to the minority class prior or tune on a validation set.\n- Over-large γ: very high γ can excessively down-weight easy examples and slow or destabilize training. Start with γ in [1,3] and adjust modestly.\n- Incorrect gradient through softmax/sigmoid: implement derivatives carefully; for multiclass, use the softmax Jacobian p_c(\delta_{ck} - p_k).\n- Mixing label encodings: for binary versions, ensure y ∈ {0,1} (not {-1,1}) unless you re-derive the formula.\n- Averaging/normalization mistakes: be consistent when averaging losses over a batch, especially when using per-class α weights; otherwise gradients can become scale-dependent in unintended ways.

Key Formulas

Focal Loss (Binary form)

FL(pt​)=−αt​(1−pt​)γlog(pt​)

Explanation: The focal loss scales cross-entropy by (1 - pt​)^{γ} to reduce the impact of well-classified examples. αt​ balances class contributions when data are imbalanced.

Definition of p_t and Sigmoid

pt​={p1−p​if y=1if y=0​,p=σ(z)=1+e−z1​

Explanation: pt​ denotes the model’s probability assigned to the true class. The sigmoid maps the logit z to probability p in binary settings.

Multiclass Focal Loss

FLmulti​=−c=1∑C​αc​yc​(1−pc​)γlog(pc​),pc​=∑j=1C​ezj​ezc​​

Explanation: For one-hot labels yc​, only the true class contributes. Softmax converts logits to a valid probability distribution over classes.

Derivative w.r.t. p_t

∂pt​∂FL​=αt​γ(1−pt​)γ−1log(pt​)−αt​pt​(1−pt​)γ​

Explanation: This shows how the loss changes with the probability of the true class. It combines the effect of the modulating factor and the logarithm of the probability.

Binary Gradient via Chain Rule

∂z∂FL​=∂p∂FL​∂z∂p​,∂z∂p​=p(1−p)

Explanation: To backpropagate through logits, multiply the derivative w.r.t. probability by the sigmoid derivative p(1-p). This yields gradients for logistic models.

Softmax Jacobian

∂zk​∂pi​​=pi​(δik​−pk​)

Explanation: This relation allows computing gradients of losses through softmax. For focal loss, only the true class probability’s derivative is needed, then multiplied by this Jacobian.

Reduction to Cross-Entropy

γ→0lim​FL(pt​)=−αt​log(pt​)

Explanation: When the focusing parameter is zero, the modulating factor becomes 1, and focal loss reduces to weighted cross-entropy.

Sensitivity to Gamma

∂γ∂FL​=−αt​(1−pt​)γln(1−pt​)log(pt​)

Explanation: This derivative shows how the loss changes with γ. It is useful for analyzing or even learning γ, though typically γ is tuned as a hyperparameter.

Cross-Entropy

CE=−c=1∑C​yc​log(pc​)

Explanation: Standard classification loss without focusing or class rebalancing. Focal loss generalizes this by adding class weights and the modulating factor.

Computational Complexity

T(N,C)=O(NC),S(N,C)=O(C) per sample

Explanation: Computing probabilities and the focal term per class is linear in C per example, leading to O(NC) per batch. Memory is dominated by storing activations and gradients.

Complexity Analysis

Computing focal loss adds minimal overhead to the standard cross-entropy pipeline. For a batch of N samples and C classes, we must compute probabilities (via sigmoid for binary or softmax for multiclass), apply the modulating factor (1 - pt​)^γ, and take logarithms. Each of these operations is O(1) per class, so the total forward cost is O(NC). The backward pass requires derivatives through sigmoid/softmax and the focal term; this is also O(NC), dominated by vector operations similar to cross-entropy. Space complexity per sample is O(C) to store logits, probabilities, and (optionally) intermediate terms like pt​ and the modulating factor. In practice, deep learning frameworks reuse buffers and fuse operations, making the additional memory for the focal term negligible relative to activations in preceding layers. When training large models, overall runtime is governed by matrix multiplications in earlier layers; the loss computation remains a tiny fraction of total time. Numerical stability measures (e.g., probability clipping, log-sum-exp for softmax) add constant-time safeguards without changing asymptotic behavior. Thus, adopting focal loss typically maintains the same big-O profile as cross-entropy while providing better gradient signals for imbalanced or hard-example-heavy datasets.

Code Examples

Binary Focal Loss (forward + gradient for a batch)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct BinaryFocalLossResult {
5 double avg_loss; // average loss over batch
6 vector<double> dL_dz; // gradient w.r.t. logits z for each sample
7};
8
9// Compute sigmoid with numerical stability
10static inline double sigmoid(double z) {
11 if (z >= 0) {
12 double ez = exp(-z);
13 return 1.0 / (1.0 + ez);
14 } else {
15 double ez = exp(z);
16 return ez / (1.0 + ez);
17 }
18}
19
20BinaryFocalLossResult binary_focal_loss(const vector<double>& logits,
21 const vector<int>& labels, // 0 or 1
22 double alpha_pos = 0.25, // alpha for y=1
23 double gamma = 2.0,
24 double eps = 1e-8) {
25 size_t n = logits.size();
26 BinaryFocalLossResult out;
27 out.dL_dz.assign(n, 0.0);
28
29 double loss_sum = 0.0;
30 for (size_t i = 0; i < n; ++i) {
31 int y = labels[i];
32 double z = logits[i];
33 double p = sigmoid(z); // probability for class 1
34 // p_t depends on label
35 double p_t = (y == 1) ? p : (1.0 - p);
36 // alpha_t balances classes
37 double alpha_t = (y == 1) ? alpha_pos : (1.0 - alpha_pos);
38
39 // Clip to avoid log(0) and division by zero
40 p_t = min(max(p_t, eps), 1.0 - eps);
41 p = min(max(p, eps), 1.0 - eps);
42
43 // Focal Loss for this example
44 double mod = pow(1.0 - p_t, gamma); // (1 - p_t)^gamma
45 double L = -alpha_t * mod * log(p_t);
46 loss_sum += L;
47
48 // Derivative dL/dp_t
49 double dL_dp_t = alpha_t * gamma * pow(1.0 - p_t, gamma - 1.0) * log(p_t)
50 - alpha_t * pow(1.0 - p_t, gamma) / p_t;
51
52 // Convert to dL/dp using dp_t/dp = +1 if y=1, -1 if y=0
53 double s = (y == 1) ? 1.0 : -1.0;
54 double dL_dp = dL_dp_t * s;
55
56 // Chain rule to logits: dp/dz = p(1-p)
57 double dL_dz = dL_dp * (p * (1.0 - p));
58 out.dL_dz[i] = dL_dz;
59 }
60
61 out.avg_loss = loss_sum / static_cast<double>(n);
62 return out;
63}
64
65int main() {
66 // Toy batch: logits and labels with imbalance
67 vector<double> logits = {-3.0, -1.0, 0.0, 2.0, 4.0};
68 vector<int> labels = {1, 0, 0, 1, 0};
69
70 auto res = binary_focal_loss(logits, labels, /*alpha_pos=*/0.75, /*gamma=*/2.0);
71
72 cout << fixed << setprecision(6);
73 cout << "Average focal loss: " << res.avg_loss << "\n";
74 cout << "Gradients dL/dz: ";
75 for (double g : res.dL_dz) cout << g << ' ';
76 cout << "\n";
77 return 0;
78}
79

Computes the binary focal loss and its gradient with respect to logits for a mini-batch. We use a numerically stable sigmoid, clip probabilities for stability, and apply the chain rule to backpropagate through sigmoid. α is assigned per class (positives vs. negatives), and γ controls the modulating factor. The output gradients can be used in a custom training loop.

Time: O(n) for n samplesSpace: O(n) to store gradients
Multiclass Focal Loss with Softmax (forward + gradient for one sample)
1#include <bits/stdc++.h>
2using namespace std;
3
4struct MultiFocalGrad {
5 double loss;
6 vector<double> dL_dz; // gradient w.r.t. logits (size C)
7};
8
9static inline double logsumexp(const vector<double>& v) {
10 double m = *max_element(v.begin(), v.end());
11 double sum = 0.0;
12 for (double x : v) sum += exp(x - m);
13 return m + log(sum);
14}
15
16MultiFocalGrad multiclass_focal_loss_one(const vector<double>& logits, // size C
17 int true_class, // 0..C-1
18 const vector<double>& alpha, // size C (class weights)
19 double gamma = 2.0,
20 double eps = 1e-8) {
21 int C = (int)logits.size();
22 MultiFocalGrad out; out.dL_dz.assign(C, 0.0);
23
24 // Stable softmax
25 double lse = logsumexp(logits);
26 vector<double> p(C);
27 for (int c = 0; c < C; ++c) p[c] = exp(logits[c] - lse);
28
29 // True class probability and alpha
30 double pc = min(max(p[true_class], eps), 1.0 - eps);
31 double alpha_c = alpha.empty() ? 1.0 : alpha[true_class];
32
33 // Loss
34 double mod = pow(1.0 - pc, gamma);
35 out.loss = -alpha_c * mod * log(pc);
36
37 // dL/dp_true
38 double dL_dpc = alpha_c * gamma * pow(1.0 - pc, gamma - 1.0) * log(pc)
39 - alpha_c * pow(1.0 - pc, gamma) / pc;
40
41 // Use softmax Jacobian: dp_true/dz_k = p_true (delta_{k,true} - p_k)
42 for (int k = 0; k < C; ++k) {
43 double dpc_dzk = p[true_class] * ((k == true_class ? 1.0 : 0.0) - p[k]);
44 out.dL_dz[k] = dL_dpc * dpc_dzk;
45 }
46
47 return out;
48}
49
50int main(){
51 vector<double> z = {2.0, 0.5, -1.0}; // logits for 3 classes
52 int y = 0; // true class
53 vector<double> alpha = {1.0, 1.0, 1.0}; // uniform weights
54
55 auto res = multiclass_focal_loss_one(z, y, alpha, 2.0);
56 cout << fixed << setprecision(6);
57 cout << "Loss: " << res.loss << "\nGradients: ";
58 for (double g : res.dL_dz) cout << g << ' ';
59 cout << "\n";
60 return 0;
61}
62

Computes multiclass focal loss for a single example with a numerically stable softmax and returns the gradient w.r.t. each logit. Only the true class probability enters the focal term, and its gradient is propagated to all logits via the softmax Jacobian.

Time: O(C) per exampleSpace: O(C) for probabilities and gradients
Training a tiny binary logistic regression with Focal Loss (SGD)
1#include <bits/stdc++.h>
2using namespace std;
3
4static inline double sigmoid(double z) {
5 if (z >= 0) { double ez = exp(-z); return 1.0/(1.0+ez); }
6 double ez = exp(z); return ez/(1.0+ez);
7}
8
9struct Example { vector<double> x; int y; };
10
11// Compute batch focal gradients w.r.t. logits and average loss
12void focal_batch_grad(const vector<double>& z, const vector<int>& y,
13 vector<double>& dL_dz, double& avg_loss,
14 double alpha_pos = 0.5, double gamma = 2.0, double eps=1e-8) {
15 int n = (int)z.size();
16 dL_dz.assign(n, 0.0);
17 double loss_sum = 0.0;
18 for (int i = 0; i < n; ++i) {
19 double p = sigmoid(z[i]);
20 int yi = y[i];
21 double pt = (yi==1)? p : (1.0-p);
22 double at = (yi==1)? alpha_pos : (1.0-alpha_pos);
23 pt = min(max(pt, eps), 1.0 - eps);
24 p = min(max(p, eps), 1.0 - eps);
25 double mod = pow(1.0 - pt, gamma);
26 double Li = -at * mod * log(pt);
27 loss_sum += Li;
28 double dL_dpt = at * gamma * pow(1.0 - pt, gamma - 1.0) * log(pt)
29 - at * pow(1.0 - pt, gamma) / pt;
30 double s = (yi==1)? 1.0 : -1.0;
31 double dL_dp = dL_dpt * s;
32 dL_dz[i] = dL_dp * (p * (1.0 - p));
33 }
34 avg_loss = loss_sum / n;
35}
36
37int main(){
38 ios::sync_with_stdio(false);
39 cin.tie(nullptr);
40
41 // Create an imbalanced synthetic dataset: positives are rare
42 int N = 2000, d = 5;
43 vector<Example> data; data.reserve(N);
44 std::mt19937 rng(123);
45 normal_distribution<double> noise(0.0, 1.0);
46 bernoulli_distribution rare_pos(0.1); // 10% positives
47
48 // True weights for data generation
49 vector<double> w_true(d, 0.0); w_true[0] = 2.5; w_true[1] = -1.5; w_true[2] = 1.0;
50 double b_true = -0.3;
51
52 auto dot = [&](const vector<double>& a, const vector<double>& b){
53 double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; };
54
55 for (int i = 0; i < N; ++i) {
56 vector<double> x(d);
57 for (int j = 0; j < d; ++j) x[j] = noise(rng);
58 int y = rare_pos(rng) ? 1 : 0;
59 // make positives slightly separable
60 if (y==1) x[0] += 1.5, x[2] += 0.7;
61 data.push_back({x, y});
62 }
63
64 // Initialize model
65 vector<double> w(d, 0.0);
66 double b = 0.0, lr = 0.05;
67
68 // Train with SGD
69 int epochs = 10; int batch = 64;
70 for (int ep = 0; ep < epochs; ++ep) {
71 shuffle(data.begin(), data.end(), rng);
72 double epoch_loss = 0.0; int batches = 0;
73 for (int i = 0; i < N; i += batch) {
74 int R = min(batch, N - i);
75 vector<double> z(R), dL_dz(R);
76 vector<int> y(R);
77 // forward logits
78 for (int r = 0; r < R; ++r) {
79 y[r] = data[i+r].y;
80 z[r] = inner_product(data[i+r].x.begin(), data[i+r].x.end(), w.begin(), 0.0) + b;
81 }
82 // loss + gradient wrt logits
83 double avg_loss; focal_batch_grad(z, y, dL_dz, avg_loss, /*alpha_pos=*/0.75, /*gamma=*/2.0);
84 epoch_loss += avg_loss; ++batches;
85 // backprop to weights: dL/dw = sum_i dL/dz_i * x_i, dL/db = sum_i dL/dz_i
86 vector<double> grad_w(d, 0.0); double grad_b = 0.0;
87 for (int r = 0; r < R; ++r) {
88 grad_b += dL_dz[r];
89 for (int j = 0; j < d; ++j) grad_w[j] += dL_dz[r] * data[i+r].x[j];
90 }
91 // SGD update
92 for (int j = 0; j < d; ++j) w[j] -= lr * (grad_w[j] / R);
93 b -= lr * (grad_b / R);
94 }
95 cout << "Epoch " << ep+1 << ": avg loss = " << (epoch_loss / batches) << "\n";
96 }
97
98 // Evaluate
99 int TP=0, FP=0, TN=0, FN=0;
100 for (auto& e : data) {
101 double z = inner_product(e.x.begin(), e.x.end(), w.begin(), 0.0) + b;
102 double p = sigmoid(z);
103 int pred = (p >= 0.5);
104 if (e.y==1 && pred==1) ++TP; else if (e.y==1) ++FN; else if (pred==1) ++FP; else ++TN;
105 }
106 cout << "Confusion matrix: TP=" << TP << " FP=" << FP << " TN=" << TN << " FN=" << FN << "\n";
107 double precision = TP/(double)(TP+FP+1e-12), recall = TP/(double)(TP+FN+1e-12);
108 cout << fixed << setprecision(4);
109 cout << "Precision=" << precision << " Recall=" << recall << "\n";
110 return 0;
111}
112

A minimal end-to-end example training a logistic regression classifier with focal loss on an imbalanced synthetic dataset. The code computes batch logits, obtains focal loss gradients, and uses SGD to update weights. It reports epoch losses and a simple confusion-matrix-based evaluation, illustrating how focal loss can emphasize rare positives via α and γ.

Time: O(epochs × N × d) for training; loss/grad per batch is O(batch × d)Space: O(d) for model parameters plus O(batch) for intermediate buffers
#focal loss#class imbalance#cross-entropy#alpha balancing#gamma focusing parameter#softmax#sigmoid#hard example mining#object detection#gradient#probability clipping#numerical stability#logits#weighted loss