📚TheoryIntermediate

Focal Loss

Key Points

•
Focal Loss reshapes cross-entropy so that hard, misclassified examples get more focus while easy, well-classified ones are down-weighted.
•
It introduces a focusing parameter γ (gamma) and a class-balancing weight α (alpha) to handle class imbalance and reduce the dominance of easy negatives.
•
For binary classification, Focal Loss is FL( $p_{t}$ ) = - $α_{t}$ (1 - $p_{t}$ )^{ $γ}$ log( $p_{t}$ ), where $p_{t}$ is the model’s probability for the true class.
•
When γ = 0, Focal Loss reduces to standard (weighted) cross-entropy, making it a drop-in generalization.
•
Numerically stable implementations require clipping probabilities and using stable sigmoid/softmax computations to avoid log(0).
•
In practice, γ is often set between 1 and 3, and α is used to balance rare positive classes against abundant negatives.
•
Computational complexity is similar to cross-entropy: O(NC) per batch for C classes and N samples, with minimal extra overhead for the modulating term.
•
Focal Loss is especially effective in object detection and any task with extreme class imbalance or many easy negatives.

Prerequisites

→Logistic Regression — Provides the probabilistic classification framework and introduces logits and sigmoid.
→Cross-Entropy Loss — Focal loss generalizes cross-entropy; understanding CE clarifies the role of focusing and class weights.
→Softmax and Sigmoid Functions — Required to map logits to probabilities used inside focal loss.
→Chain Rule and Basic Gradients — Backpropagation through sigmoid/softmax and the focal term relies on the chain rule.
→Numerical Stability (Log-Sum-Exp, Clipping) — Prevents overflow/underflow when computing logs and exponentials.
→C++ Vectors and Loops — Needed to implement and understand the provided code examples.
→Matrix/Vector Inner Products — Used to compute logits in linear models and in gradient updates.
→Imbalanced Classification Metrics — Precision/recall and confusion matrices contextualize why focal loss helps.

Detailed Explanation

Tap terms for definitions

01Overview

Focal Loss is a modification of cross-entropy designed to tackle class imbalance and the overwhelming presence of easy examples during training. In many real-world datasets (e.g., detecting rare events), standard cross-entropy lets the numerous easy negatives dominate the gradient, causing the model to under-learn from rare but important positives. Focal Loss addresses this by down-weighting easy examples and focusing learning on hard, misclassified ones. It achieves this through two knobs: a focusing parameter gamma (γ) that reduces the contribution of well-classified examples, and a weighting parameter alpha (α) that balances classes by giving minority classes a proportionally higher weight. Intuitively, when the model is already confident about an example, its loss and gradient shrink sharply; when the model is wrong or uncertain, the loss remains significant. This behavior leads to faster progress on underrepresented or difficult examples without heavily changing the training pipeline. Importantly, when γ = 0, Focal Loss recovers standard weighted cross-entropy, so it’s a conservative extension rather than a completely new objective.

02Intuition & Analogies

Imagine you’re grading homework for a very large class. Most students already get the easy questions right; spending more time on them won’t improve learning much. Instead, you want to devote your attention to the students who consistently miss certain questions. Focal Loss is that strategy in math form. It looks at the model’s confidence $p_t$ (the probability it assigns to the correct class). If $p_t$ is high, the example is easy; Focal Loss multiplies the usual cross-entropy by (1 - $p_t$ )^γ, which is small, so you spend little effort there. If $p_t$ is low, the example is hard; the multiplier is near 1, so the loss stays large and the model focuses on it. The α parameter plays the role of ensuring minority students (rare classes) are not ignored simply because they are few; it scales their contribution up so they remain visible in the learning process. Hook: think of a tutor who watches which questions a student gets wrong and leans into those. Concept: Focal Loss down-weights easy cases and up-weights hard ones and rare classes. Example: in object detection, there are far more background boxes than object boxes; Focal Loss prevents the overwhelming number of background (easy negatives) from drowning out the signal from true objects.

03Formal Definition

Let y be the true label and p be the model’s predicted probability for class 1 in binary classification. Define

p_{t} = p

y = 1

, and

p_{t} = 1 - p

y = 0

. The Focal Loss for a single example is FL(

p_{t}

) = -

α_{t}

(1 -

p_{t}

)^{

γ

}

lo g

(

p_{t}

), where

α_{t}

is a class-dependent weight (e.g.,

α

for positives and 1-

α

for negatives), and

γ

\geq

0 is the focusing parameter. For multiclass classification with softmax probabilities

p_{c}

and one-hot labels

y_{c}

, the loss is FL = -

\sum_{c = 1}^{C}

α_{c}

y_{c}

(1 -

p_{c}

)^{

γ

}

lo g

(

p_{c}

). When

γ

= 0, (1 -

p_{t}

)^{

γ

} = 1 and we recover weighted cross-entropy. As

γ

increases, the factor (1 -

p_{t}

)^{

γ

} suppresses losses where

p_{t}

is close to 1 (easy cases), without significantly affecting hard cases where

p_{t}

is small. The function remains differentiable for

p_{t}

in (0,1), enabling gradient-based optimization using the chain rule through sigmoid (binary) or softmax (multiclass).

04When to Use

Use Focal Loss when your dataset has severe class imbalance or contains many easy negatives that dominate training. Typical scenarios include object detection (background vs. objects), anomaly and fraud detection (rare positives), medical diagnosis (rare disease), and extreme multi-class problems where a few classes dominate. It is also useful when your metrics prioritize recall on rare classes or when you see the model quickly overfitting to easy examples while underperforming on hard ones. Choose \gamma in [1,3] as a starting point; higher \gamma increases focus on hard examples but can slow learning if set too high. Use \alpha when class frequencies are highly skewed or when misclassification costs differ: set \alpha higher for the rare or more costly class. If your data are relatively balanced and you don’t observe domination by easy examples, standard cross-entropy (\gamma=0) may suffice and train faster with simpler tuning.

⚠️Common Mistakes

Forgetting to pass probabilities, not logits, into the focal term (log and (1- $p_t$ )^\gamma); always apply sigmoid/softmax first or use log-sum-exp tricks for stability.\n- Numerical instability: computing \log(0) or raising near-1/near-0 to \gamma. Always clip probabilities into $\begin{pmatrix} \epsilon \\ 1-\epsilon \end{pmatrix}$ with a small \epsilon like 1e-8.\n- Misusing α: setting the same α for both classes in a highly imbalanced dataset defeats the purpose. Typically, set α close to the minority class prior or tune on a validation set.\n- Over-large γ: very high γ can excessively down-weight easy examples and slow or destabilize training. Start with γ in [1,3] and adjust modestly.\n- Incorrect gradient through softmax/sigmoid: implement derivatives carefully; for multiclass, use the softmax Jacobian $p_c$ (\delt $a_{ck}$ - $p_k$ ).\n- Mixing label encodings: for binary versions, ensure $y \in {0$ ,1} (not {-1,1}) unless you re-derive the formula.\n- Averaging/normalization mistakes: be consistent when averaging losses over a batch, especially when using per-class α weights; otherwise gradients can become scale-dependent in unintended ways.

Key Formulas

Focal Loss (Binary form)

FL (p_{t}) = - α_{t} (1 - p_{t})^{γ} lo g (p_{t})

Explanation: The focal loss scales cross-entropy by (1 - $p_{t}$ )^{ $γ$ } to reduce the impact of well-classified examples. $α_{t}$ balances class contributions when data are imbalanced.

Definition of p_t and Sigmoid

p_{t} = {p 1 - p if y = 1 if y = 0, p = σ (z) = \frac{1}{1 + e ^{- z}}

Explanation: $p_{t}$ denotes the model’s probability assigned to the true class. The sigmoid maps the logit z to probability p in binary settings.

Multiclass Focal Loss

FL_{multi} = - c = 1 \sum C α_{c} y_{c} (1 - p_{c})^{γ} lo g (p_{c}), p_{c} = \frac{e ^{z_{c}}}{\sum _{j = 1}^{C} e ^{z_{j}}}

Explanation: For one-hot labels $y_{c}$ , only the true class contributes. Softmax converts logits to a valid probability distribution over classes.

Derivative w.r.t. p_t

\frac{\partial FL}{\partial p _{t}} = α_{t} γ (1 - p_{t})^{γ - 1} lo g (p_{t}) - α_{t} \frac{( 1 - p _{t} ) ^{γ}}{p _{t}}

Explanation: This shows how the loss changes with the probability of the true class. It combines the effect of the modulating factor and the logarithm of the probability.

Binary Gradient via Chain Rule

\frac{\partial FL}{\partial z} = \frac{\partial FL}{\partial p} \frac{\partial p}{\partial z}, \frac{\partial p}{\partial z} = p (1 - p)

Explanation: To backpropagate through logits, multiply the derivative w.r.t. probability by the sigmoid derivative p(1-p). This yields gradients for logistic models.

Softmax Jacobian

\frac{\partial p _{i}}{\partial z _{k}} = p_{i} (δ_{ik} - p_{k})

Explanation: This relation allows computing gradients of losses through softmax. For focal loss, only the true class probability’s derivative is needed, then multiplied by this Jacobian.

Reduction to Cross-Entropy

γ \to 0 lim FL (p_{t}) = - α_{t} lo g (p_{t})

Explanation: When the focusing parameter is zero, the modulating factor becomes 1, and focal loss reduces to weighted cross-entropy.

Sensitivity to Gamma

\frac{\partial FL}{\partial γ} = - α_{t} (1 - p_{t})^{γ} ln (1 - p_{t}) lo g (p_{t})

Explanation: This derivative shows how the loss changes with $γ$ . It is useful for analyzing or even learning $γ$ , though typically $γ$ is tuned as a hyperparameter.

Cross-Entropy

CE = - c = 1 \sum C y_{c} lo g (p_{c})

Explanation: Standard classification loss without focusing or class rebalancing. Focal loss generalizes this by adding class weights and the modulating factor.

Computational Complexity

T (N, C) = O (NC), S (N, C) = O (C) per sample

Explanation: Computing probabilities and the focal term per class is linear in C per example, leading to O(NC) per batch. Memory is dominated by storing activations and gradients.

Complexity Analysis

Computing focal loss adds minimal overhead to the standard cross-entropy pipeline. For a batch of N samples and C classes, we must compute probabilities (via sigmoid for binary or softmax for multiclass), apply the modulating factor (1 -

p_{t}

γ,

and take logarithms. Each of these operations is O(1) per class, so the total forward cost is O(NC). The backward pass requires derivatives through sigmoid/softmax and the focal term; this is also O(NC), dominated by vector operations similar to cross-entropy. Space complexity per sample is O(C) to store logits, probabilities, and (optionally) intermediate terms like

p_{t}

and the modulating factor. In practice, deep learning frameworks reuse buffers and fuse operations, making the additional memory for the focal term negligible relative to activations in preceding layers. When training large models, overall runtime is governed by matrix multiplications in earlier layers; the loss computation remains a tiny fraction of total time. Numerical stability measures (e.g., probability clipping, log-sum-exp for softmax) add constant-time safeguards without changing asymptotic behavior. Thus, adopting focal loss typically maintains the same big-O profile as cross-entropy while providing better gradient signals for imbalanced or hard-example-heavy datasets.

Code Examples

Binary Focal Loss (forward + gradient for a batch)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct BinaryFocalLossResult {
5     double avg_loss;                 // average loss over batch
6     vector<double> dL_dz;            // gradient w.r.t. logits z for each sample
7 };
8 
9 // Compute sigmoid with numerical stability
10 static inline double sigmoid(double z) {
11     if (z >= 0) {
12         double ez = exp(-z);
13         return 1.0 / (1.0 + ez);
14     } else {
15         double ez = exp(z);
16         return ez / (1.0 + ez);
17     }
18 }
19 
20 BinaryFocalLossResult binary_focal_loss(const vector<double>& logits,
21                                         const vector<int>& labels, // 0 or 1
22                                         double alpha_pos = 0.25,   // alpha for y=1
23                                         double gamma = 2.0,
24                                         double eps = 1e-8) {
25     size_t n = logits.size();
26     BinaryFocalLossResult out; 
27     out.dL_dz.assign(n, 0.0);
28 
29     double loss_sum = 0.0;
30     for (size_t i = 0; i < n; ++i) {
31         int y = labels[i];
32         double z = logits[i];
33         double p = sigmoid(z);                 // probability for class 1
34         // p_t depends on label
35         double p_t = (y == 1) ? p : (1.0 - p);
36         // alpha_t balances classes
37         double alpha_t = (y == 1) ? alpha_pos : (1.0 - alpha_pos);
38 
39         // Clip to avoid log(0) and division by zero
40         p_t = min(max(p_t, eps), 1.0 - eps);
41         p = min(max(p, eps), 1.0 - eps);
42 
43         // Focal Loss for this example
44         double mod = pow(1.0 - p_t, gamma);    // (1 - p_t)^gamma
45         double L = -alpha_t * mod * log(p_t);
46         loss_sum += L;
47 
48         // Derivative dL/dp_t
49         double dL_dp_t = alpha_t * gamma * pow(1.0 - p_t, gamma - 1.0) * log(p_t)
50                         - alpha_t * pow(1.0 - p_t, gamma) / p_t;
51 
52         // Convert to dL/dp using dp_t/dp = +1 if y=1, -1 if y=0
53         double s = (y == 1) ? 1.0 : -1.0;
54         double dL_dp = dL_dp_t * s;
55 
56         // Chain rule to logits: dp/dz = p(1-p)
57         double dL_dz = dL_dp * (p * (1.0 - p));
58         out.dL_dz[i] = dL_dz;
59     }
60 
61     out.avg_loss = loss_sum / static_cast<double>(n);
62     return out;
63 }
64 
65 int main() {
66     // Toy batch: logits and labels with imbalance
67     vector<double> logits = {-3.0, -1.0, 0.0, 2.0, 4.0};
68     vector<int> labels  = {1, 0, 0, 1, 0};
69 
70     auto res = binary_focal_loss(logits, labels, /*alpha_pos=*/0.75, /*gamma=*/2.0);
71 
72     cout << fixed << setprecision(6);
73     cout << "Average focal loss: " << res.avg_loss << "\n";
74     cout << "Gradients dL/dz: ";
75     for (double g : res.dL_dz) cout << g << ' ';
76     cout << "\n";
77     return 0;
78 }
79

Computes the binary focal loss and its gradient with respect to logits for a mini-batch. We use a numerically stable sigmoid, clip probabilities for stability, and apply the chain rule to backpropagate through sigmoid. α is assigned per class (positives vs. negatives), and γ controls the modulating factor. The output gradients can be used in a custom training loop.

Time: O(n) for n samplesSpace: O(n) to store gradients

Multiclass Focal Loss with Softmax (forward + gradient for one sample)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct MultiFocalGrad {
5     double loss;
6     vector<double> dL_dz; // gradient w.r.t. logits (size C)
7 };
8 
9 static inline double logsumexp(const vector<double>& v) {
10     double m = *max_element(v.begin(), v.end());
11     double sum = 0.0;
12     for (double x : v) sum += exp(x - m);
13     return m + log(sum);
14 }
15 
16 MultiFocalGrad multiclass_focal_loss_one(const vector<double>& logits, // size C
17                                          int true_class,               // 0..C-1
18                                          const vector<double>& alpha,  // size C (class weights)
19                                          double gamma = 2.0,
20                                          double eps = 1e-8) {
21     int C = (int)logits.size();
22     MultiFocalGrad out; out.dL_dz.assign(C, 0.0);
23 
24     // Stable softmax
25     double lse = logsumexp(logits);
26     vector<double> p(C);
27     for (int c = 0; c < C; ++c) p[c] = exp(logits[c] - lse);
28 
29     // True class probability and alpha
30     double pc = min(max(p[true_class], eps), 1.0 - eps);
31     double alpha_c = alpha.empty() ? 1.0 : alpha[true_class];
32 
33     // Loss
34     double mod = pow(1.0 - pc, gamma);
35     out.loss = -alpha_c * mod * log(pc);
36 
37     // dL/dp_true
38     double dL_dpc = alpha_c * gamma * pow(1.0 - pc, gamma - 1.0) * log(pc)
39                   - alpha_c * pow(1.0 - pc, gamma) / pc;
40 
41     // Use softmax Jacobian: dp_true/dz_k = p_true (delta_{k,true} - p_k)
42     for (int k = 0; k < C; ++k) {
43         double dpc_dzk = p[true_class] * ((k == true_class ? 1.0 : 0.0) - p[k]);
44         out.dL_dz[k] = dL_dpc * dpc_dzk;
45     }
46 
47     return out;
48 }
49 
50 int main(){
51     vector<double> z = {2.0, 0.5, -1.0}; // logits for 3 classes
52     int y = 0; // true class
53     vector<double> alpha = {1.0, 1.0, 1.0}; // uniform weights
54 
55     auto res = multiclass_focal_loss_one(z, y, alpha, 2.0);
56     cout << fixed << setprecision(6);
57     cout << "Loss: " << res.loss << "\nGradients: ";
58     for (double g : res.dL_dz) cout << g << ' ';
59     cout << "\n";
60     return 0;
61 }
62

Computes multiclass focal loss for a single example with a numerically stable softmax and returns the gradient w.r.t. each logit. Only the true class probability enters the focal term, and its gradient is propagated to all logits via the softmax Jacobian.

Time: O(C) per exampleSpace: O(C) for probabilities and gradients

Training a tiny binary logistic regression with Focal Loss (SGD)

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 static inline double sigmoid(double z) {
5     if (z >= 0) { double ez = exp(-z); return 1.0/(1.0+ez); }
6     double ez = exp(z); return ez/(1.0+ez);
7 }
8 
9 struct Example { vector<double> x; int y; };
10 
11 // Compute batch focal gradients w.r.t. logits and average loss
12 void focal_batch_grad(const vector<double>& z, const vector<int>& y,
13                       vector<double>& dL_dz, double& avg_loss,
14                       double alpha_pos = 0.5, double gamma = 2.0, double eps=1e-8) {
15     int n = (int)z.size();
16     dL_dz.assign(n, 0.0);
17     double loss_sum = 0.0;
18     for (int i = 0; i < n; ++i) {
19         double p = sigmoid(z[i]);
20         int yi = y[i];
21         double pt = (yi==1)? p : (1.0-p);
22         double at = (yi==1)? alpha_pos : (1.0-alpha_pos);
23         pt = min(max(pt, eps), 1.0 - eps);
24         p = min(max(p, eps), 1.0 - eps);
25         double mod = pow(1.0 - pt, gamma);
26         double Li = -at * mod * log(pt);
27         loss_sum += Li;
28         double dL_dpt = at * gamma * pow(1.0 - pt, gamma - 1.0) * log(pt)
29                       - at * pow(1.0 - pt, gamma) / pt;
30         double s = (yi==1)? 1.0 : -1.0;
31         double dL_dp = dL_dpt * s;
32         dL_dz[i] = dL_dp * (p * (1.0 - p));
33     }
34     avg_loss = loss_sum / n;
35 }
36 
37 int main(){
38     ios::sync_with_stdio(false);
39     cin.tie(nullptr);
40 
41     // Create an imbalanced synthetic dataset: positives are rare
42     int N = 2000, d = 5;
43     vector<Example> data; data.reserve(N);
44     std::mt19937 rng(123);
45     normal_distribution<double> noise(0.0, 1.0);
46     bernoulli_distribution rare_pos(0.1); // 10% positives
47 
48     // True weights for data generation
49     vector<double> w_true(d, 0.0); w_true[0] = 2.5; w_true[1] = -1.5; w_true[2] = 1.0;
50     double b_true = -0.3;
51 
52     auto dot = [&](const vector<double>& a, const vector<double>& b){
53         double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; };
54 
55     for (int i = 0; i < N; ++i) {
56         vector<double> x(d);
57         for (int j = 0; j < d; ++j) x[j] = noise(rng);
58         int y = rare_pos(rng) ? 1 : 0;
59         // make positives slightly separable
60         if (y==1) x[0] += 1.5, x[2] += 0.7;
61         data.push_back({x, y});
62     }
63 
64     // Initialize model
65     vector<double> w(d, 0.0);
66     double b = 0.0, lr = 0.05;
67 
68     // Train with SGD
69     int epochs = 10; int batch = 64;
70     for (int ep = 0; ep < epochs; ++ep) {
71         shuffle(data.begin(), data.end(), rng);
72         double epoch_loss = 0.0; int batches = 0;
73         for (int i = 0; i < N; i += batch) {
74             int R = min(batch, N - i);
75             vector<double> z(R), dL_dz(R);
76             vector<int> y(R);
77             // forward logits
78             for (int r = 0; r < R; ++r) {
79                 y[r] = data[i+r].y;
80                 z[r] = inner_product(data[i+r].x.begin(), data[i+r].x.end(), w.begin(), 0.0) + b;
81             }
82             // loss + gradient wrt logits
83             double avg_loss; focal_batch_grad(z, y, dL_dz, avg_loss, /*alpha_pos=*/0.75, /*gamma=*/2.0);
84             epoch_loss += avg_loss; ++batches;
85             // backprop to weights: dL/dw = sum_i dL/dz_i * x_i, dL/db = sum_i dL/dz_i
86             vector<double> grad_w(d, 0.0); double grad_b = 0.0;
87             for (int r = 0; r < R; ++r) {
88                 grad_b += dL_dz[r];
89                 for (int j = 0; j < d; ++j) grad_w[j] += dL_dz[r] * data[i+r].x[j];
90             }
91             // SGD update
92             for (int j = 0; j < d; ++j) w[j] -= lr * (grad_w[j] / R);
93             b -= lr * (grad_b / R);
94         }
95         cout << "Epoch " << ep+1 << ": avg loss = " << (epoch_loss / batches) << "\n";
96     }
97 
98     // Evaluate
99     int TP=0, FP=0, TN=0, FN=0;
100     for (auto& e : data) {
101         double z = inner_product(e.x.begin(), e.x.end(), w.begin(), 0.0) + b;
102         double p = sigmoid(z);
103         int pred = (p >= 0.5);
104         if (e.y==1 && pred==1) ++TP; else if (e.y==1) ++FN; else if (pred==1) ++FP; else ++TN;
105     }
106     cout << "Confusion matrix: TP=" << TP << " FP=" << FP << " TN=" << TN << " FN=" << FN << "\n";
107     double precision = TP/(double)(TP+FP+1e-12), recall = TP/(double)(TP+FN+1e-12);
108     cout << fixed << setprecision(4);
109     cout << "Precision=" << precision << " Recall=" << recall << "\n";
110     return 0;
111 }
112

A minimal end-to-end example training a logistic regression classifier with focal loss on an imbalanced synthetic dataset. The code computes batch logits, obtains focal loss gradients, and uses SGD to update weights. It reports epoch losses and a simple confusion-matrix-based evaluation, illustrating how focal loss can emphasize rare positives via α and γ.

Time: O(epochs × N × d) for training; loss/grad per batch is O(batch × d)Space: O(d) for model parameters plus O(batch) for intermediate buffers

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct BinaryFocalLossResult {
5	double avg_loss; // average loss over batch
6	vector<double> dL_dz; // gradient w.r.t. logits z for each sample
7	};
8
9	// Compute sigmoid with numerical stability
10	static inline double sigmoid(double z) {
11	if (z >= 0) {
12	double ez = exp(-z);
13	return 1.0 / (1.0 + ez);
14	} else {
15	double ez = exp(z);
16	return ez / (1.0 + ez);
17	}
18	}
19
20	BinaryFocalLossResult binary_focal_loss(const vector<double>& logits,
21	const vector<int>& labels, // 0 or 1
22	double alpha_pos = 0.25, // alpha for y=1
23	double gamma = 2.0,
24	double eps = 1e-8) {
25	size_t n = logits.size();
26	BinaryFocalLossResult out;
27	out.dL_dz.assign(n, 0.0);
28
29	double loss_sum = 0.0;
30	for (size_t i = 0; i < n; ++i) {
31	int y = labels[i];
32	double z = logits[i];
33	double p = sigmoid(z); // probability for class 1
34	// p_t depends on label
35	double p_t = (y == 1) ? p : (1.0 - p);
36	// alpha_t balances classes
37	double alpha_t = (y == 1) ? alpha_pos : (1.0 - alpha_pos);
38
39	// Clip to avoid log(0) and division by zero
40	p_t = min(max(p_t, eps), 1.0 - eps);
41	p = min(max(p, eps), 1.0 - eps);
42
43	// Focal Loss for this example
44	double mod = pow(1.0 - p_t, gamma); // (1 - p_t)^gamma
45	double L = -alpha_t * mod * log(p_t);
46	loss_sum += L;
47
48	// Derivative dL/dp_t
49	double dL_dp_t = alpha_t * gamma * pow(1.0 - p_t, gamma - 1.0) * log(p_t)
50	- alpha_t * pow(1.0 - p_t, gamma) / p_t;
51
52	// Convert to dL/dp using dp_t/dp = +1 if y=1, -1 if y=0
53	double s = (y == 1) ? 1.0 : -1.0;
54	double dL_dp = dL_dp_t * s;
55
56	// Chain rule to logits: dp/dz = p(1-p)
57	double dL_dz = dL_dp * (p * (1.0 - p));
58	out.dL_dz[i] = dL_dz;
59	}
60
61	out.avg_loss = loss_sum / static_cast<double>(n);
62	return out;
63	}
64
65	int main() {
66	// Toy batch: logits and labels with imbalance
67	vector<double> logits = {-3.0, -1.0, 0.0, 2.0, 4.0};
68	vector<int> labels = {1, 0, 0, 1, 0};
69
70	auto res = binary_focal_loss(logits, labels, /alpha_pos=/0.75, /gamma=/2.0);
71
72	cout << fixed << setprecision(6);
73	cout << "Average focal loss: " << res.avg_loss << "\n";
74	cout << "Gradients dL/dz: ";
75	for (double g : res.dL_dz) cout << g << ' ';
76	cout << "\n";
77	return 0;
78	}
79

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct MultiFocalGrad {
5	double loss;
6	vector<double> dL_dz; // gradient w.r.t. logits (size C)
7	};
8
9	static inline double logsumexp(const vector<double>& v) {
10	double m = *max_element(v.begin(), v.end());
11	double sum = 0.0;
12	for (double x : v) sum += exp(x - m);
13	return m + log(sum);
14	}
15
16	MultiFocalGrad multiclass_focal_loss_one(const vector<double>& logits, // size C
17	int true_class, // 0..C-1
18	const vector<double>& alpha, // size C (class weights)
19	double gamma = 2.0,
20	double eps = 1e-8) {
21	int C = (int)logits.size();
22	MultiFocalGrad out; out.dL_dz.assign(C, 0.0);
23
24	// Stable softmax
25	double lse = logsumexp(logits);
26	vector<double> p(C);
27	for (int c = 0; c < C; ++c) p[c] = exp(logits[c] - lse);
28
29	// True class probability and alpha
30	double pc = min(max(p[true_class], eps), 1.0 - eps);
31	double alpha_c = alpha.empty() ? 1.0 : alpha[true_class];
32
33	// Loss
34	double mod = pow(1.0 - pc, gamma);
35	out.loss = -alpha_c * mod * log(pc);
36
37	// dL/dp_true
38	double dL_dpc = alpha_c * gamma * pow(1.0 - pc, gamma - 1.0) * log(pc)
39	- alpha_c * pow(1.0 - pc, gamma) / pc;
40
41	// Use softmax Jacobian: dp_true/dz_k = p_true (delta_{k,true} - p_k)
42	for (int k = 0; k < C; ++k) {
43	double dpc_dzk = p[true_class] * ((k == true_class ? 1.0 : 0.0) - p[k]);
44	out.dL_dz[k] = dL_dpc * dpc_dzk;
45	}
46
47	return out;
48	}
49
50	int main(){
51	vector<double> z = {2.0, 0.5, -1.0}; // logits for 3 classes
52	int y = 0; // true class
53	vector<double> alpha = {1.0, 1.0, 1.0}; // uniform weights
54
55	auto res = multiclass_focal_loss_one(z, y, alpha, 2.0);
56	cout << fixed << setprecision(6);
57	cout << "Loss: " << res.loss << "\nGradients: ";
58	for (double g : res.dL_dz) cout << g << ' ';
59	cout << "\n";
60	return 0;
61	}
62