📚TheoryIntermediate

Multi-Task Loss Balancing

Key Points

•
Multi-task loss balancing aims to automatically set each task’s weight so that no single loss dominates training.
•
Uncertainty weighting uses a learnable noise parameter $σ_{i}$ per task and defines the weight as $w_{i} = 1/ (2 σ_{i}^{2}$ ), with an extra $lo g$ $σ_{i}$ term to prevent trivial solutions.
•
Reparameterizing with $s_{i}$ = $lo g$ σ_ $i^{2}$ makes optimization stable and ensures $σ_{i}$ stays positive.
•
At optimum, each task’s contribution scales inversely to its noise, so noisier tasks get down-weighted while cleaner tasks drive learning.
•
The combined loss can be derived from maximum likelihood with Gaussian (and related) likelihoods, giving a principled probabilistic foundation.
•
In practice you update model parameters with weighted gradients and update $s_{i}$ with a simple closed-form gradient.
•
Numerically, clamp or regularize $s_{i}$ to avoid extreme weights and use log-variance to prevent negative $σ_{i} .$
•
Computational overhead is minimal: combining T task losses and updating T uncertainties is O(T) per step in time and space.

Prerequisites

→Gradient descent and backpropagation — Uncertainty weighting is optimized jointly with model parameters using gradient-based methods.
→Loss functions (MSE, cross-entropy) — You must compute per-task losses consistently to form the weighted sum.
→Probability and Gaussian likelihood — The uncertainty-weighted objective is derived from maximum likelihood with Gaussian noise.
→Exponential and logarithm functions — Reparameterization uses s_i = log σ_i^2 and weights use exp(-s_i).
→Numerical stability practices — Clamping, separate learning rates, and regularization prevent pathological weights.

Detailed Explanation

Tap terms for definitions

01Overview

Multi-task loss balancing addresses a common problem in multi-task learning: different tasks naturally produce losses on different scales. If we simply sum the raw losses, the largest-magnitude loss can dominate, causing the model to prioritize one task at the expense of others. Uncertainty weighting provides a principled way to balance tasks by introducing a learnable noise parameter for each task that sets its weight automatically during training. Concretely, for task i with loss $L_i$ , we add a learnable parameter σ_i (interpreted as homoscedastic—input-independent—noise). The combined loss is a weighted sum \su $m_i$ (1/(2σ_ $i^2$ )) $L_i$ plus a stabilizing term \su $m_i$ \log σ_i. This form arises naturally from maximum likelihood estimation under Gaussian (for regression) and related likelihoods (for classification) and yields a closed-form gradient for the uncertainty parameters. The effect is that tasks with higher inherent noise get smaller weights, letting the model focus on informative, lower-noise tasks without manual tuning of hyperparameters. The method is lightweight, differentiable, and straightforward to integrate into any gradient-based training loop.

02Intuition & Analogies

Imagine you’re trying to listen to several people speaking at once. Some are speaking clearly (low noise), others are mumbling (high noise). If you try to pay equal attention to all, the mumbling overwhelms you with confusion and you learn less from the clear speakers. A sensible strategy is to pay more attention to clear voices and less to the noisy ones. Uncertainty weighting does exactly this for machine learning tasks. Each task comes with its own level of background noise or difficulty. Instead of guessing how much to weigh each task, we let the model learn how noisy each task is. The learned noise becomes a volume knob: if a task is noisy, turn it down; if it’s crisp and reliable, turn it up. The clever trick is to represent the volume knob via σ_i, the task’s noise level, and define the task’s weight as 1/(2σ_ $i^2$ ). This makes intuitive sense: if σ_i is large (very noisy), its weight shrinks; if σ_i is small (clean), its weight grows. But we must prevent the model from cheating by driving σ_i to zero (infinite weight). That’s why we add a small penalty term, \log σ_i, which grows when σ_i becomes tiny, keeping the game fair. By optimizing both the model parameters and these per-task noise knobs together, the learning automatically balances attention across tasks, focusing on where it can learn the most.

03Formal Definition

Consider T tasks, each with a per-batch loss

L_{i}

(

θ

i = 1

\dots

, T, where

θ

are shared and task-specific model parameters. Assign each task a homoscedastic uncertainty parameter σ_

i > 0

. The uncertainty-weighted total loss is L(

θ

, \{

σ_{i}}

) =

\sum_{i = 1}^{T}

\left

(

\frac{1}{2 σ _{i}^{2}}

L_{i}

(

θ

) +

lo g

σ_{i}

\right

). This objective can be derived from maximum likelihood under Gaussian likelihoods for regression (and analogous forms for classification), where

lo g

σ_{i}

acts as a regularizer ensuring identifiability and preventing

σ_{i}

\to

0. For numerical stability, reparameterize with

s_{i}

lo g

σ_

i^{2}

\in

R

, giving L(

θ

, \{

s_{i}

\}) =

\sum_{i = 1}^{T}

\left

(

\frac{1}{2}

e^{- s_{i}}

L_{i}

(

θ

) +

\frac{1}{2}

s_{i}

\right

). The gradient with respect to

s_{i}

\partial

L /

\partial

s_{i}

\frac{1}{2}

(1 -

e^{- s_{i}}

L_{i}

(

θ

)), which yields a simple update rule in gradient descent. The gradient with respect to model parameters is a weighted sum:

\nabla_{θ}

L =

\sum_{i = 1}^{T}

w_{i}

\nabla_{θ}

L_{i}

(

θ

), where

w_{i}

\frac{1}{2}

e^{- s_{i}} = 1/ (2 σ_{i}^{2}

). At stationarity (holding

θ

fixed),

e^{- s_{i}}

L_{i}

\approx

1, implying

w_{i}

\propto

1 /

L_{i}

, thus down-weighting larger-scale (noisier) losses.

04When to Use

Use uncertainty weighting in multi-task learning when tasks differ in loss scale, noise level, or difficulty and you do not want to hand-tune per-task weights. It is well-suited for joint regression tasks (e.g., depth and surface normal estimation), combinations of regression and classification (with appropriate likelihood forms), or any scenario where the noise level per task is roughly constant across inputs (homoscedastic). It is especially helpful early in training when task scales are very different and fixed heuristic weights can misguide learning. Avoid it when you have only a single task, when per-example (heteroscedastic) uncertainty is essential (then model σ as a function of inputs), or when task losses are already carefully normalized and comparable. Also consider that if tasks fundamentally conflict (negative transfer), weighting alone may be insufficient—techniques like gradient surgery, task routing, or separate optimizers might be needed in addition to or instead of uncertainty weighting.

⚠️Common Mistakes

Omitting the \log σ_i term: Without it, the model can drive σ_i \to 0, giving infinite weight to a task and collapsing training. Always include the regularizer.
Optimizing σ_i directly: Directly updating σ_i can make it negative or unstable. Optimize $s_i$ = \log σ_ $i^2$ instead, which guarantees σ_i > 0 via σ_i = $e^{s_i/2}$ .
Extreme weights from unbounded $s_i$ : Large negative $s_i$ produce huge weights and gradient explosions. Clamp $s_i$ within a reasonable range (e.g., [-5, 5]) or add mild L2 regularization.
Mixing batch scales inconsistently: Compute $L_i$ consistently (e.g., mean over batch). If $L_i$ has different implicit scaling per batch, weights may fluctuate wildly.
Wrong learning rates: The uncertainty parameters $s_i$ often need a smaller/independent learning rate than the model parameters to avoid oscillation.
Comparing raw weights across tasks without context: Weights adapt to current losses; transient spikes are normal early in training.
Forgetting that homoscedasticity is an assumption: If uncertainty varies per input, consider heteroscedastic models with σ(x) instead of a single σ_i per task.

Key Formulas

Uncertainty-weighted multi-task loss

L (θ, {σ_{i}}) = i = 1 \sum T (\frac{1}{2 σ _{i}^{2}} L_{i} (θ) + lo g σ_{i})

Explanation: Each task i contributes its loss scaled by 1/(2 $σ_{i}^{2})$ plus a log $σ_{i}$ penalty. This balances tasks and prevents $σ_{i}$ from collapsing to zero.

Log-variance reparameterization and weight

s_{i} = lo g σ_{i}^{2}, w_{i} = \frac{1}{2} e^{- s_{i}}

Explanation: Optimizing $s_{i}$ is unconstrained and numerically stable. The task weight is 0.5 times the exponential of negative log-variance.

Loss in log-variance form

L (θ, {s_{i}}) = i = 1 \sum T (\frac{1}{2} e^{- s_{i}} L_{i} (θ) + \frac{1}{2} s_{i})

Explanation: This is algebraically equivalent to the $σ - p a r am e t er i z a t i o n$ but is preferred in practice due to stability and positivity of $σ .$

Gradient w.r.t. log-variance

\frac{\partial L}{\partial s _{i}} = \frac{1}{2} (1 - e^{- s_{i}} L_{i} (θ))

Explanation: A simple, closed-form gradient drives $s_{i}$ so that $e^{- s_{i}}$ $L_{i}$ ≈ 1 at equilibrium. It is easy to implement and stable.

Weighted gradient for model parameters

\nabla_{θ} L = i = 1 \sum T w_{i} \nabla_{θ} L_{i} (θ)

Explanation: Model parameters are updated with a weighted sum of per-task gradients, scaled by the learned weights $w_{i}$ .

Gaussian NLL

- lo g p (y ∣ f, σ^{2}) = \frac{1}{2 σ ^{2}} ∥ y - f ∥_{2}^{2} + \frac{1}{2} lo g (2 π σ^{2})

Explanation: The negative log-likelihood for Gaussian noise decomposes into a quadratic error term and a log-variance term. Dropping constants yields the uncertainty-weighted loss form.

Weight definition

w_{i} = \frac{1}{2 σ _{i}^{2}} = \frac{1}{2} e^{- s_{i}}

Explanation: Weights are inversely proportional to variance; higher-noise tasks are down-weighted automatically during training.

Stationarity condition

e^{- s_{i}} L_{i} (θ) \approx 1 \Rightarrow w_{i} \propto \frac{1}{L _{i} ( θ )}

Explanation: At a stationary point (for fixed $θ),$ each task’s scaled loss tends to 1, showing that the method roughly inverts loss scales.

Complexity Analysis

Let T be the number of tasks and B the batch size used to compute each

L_{i}

. The additional computation to perform uncertainty weighting per step consists of: (1) computing the scalar weights

w_{i} = 0

e^{- s_{i}}

for all tasks, (2) forming the weighted sum of losses, and (3) updating the T log-variance parameters

s_{i}

by their gradients. Each of these steps is O(T) time and O(T) space; the overall training step remains dominated by the forward and backward passes of the model, which are typically O(B · C) where C captures model-dependent costs (e.g., number of parameters and operations per example). Memory overhead is small: you store T scalars for

s_{i}

and optionally the T weights. Backpropagation through the weighted sum simply scales each task’s gradient by

w_{i}

, so the backward complexity is unchanged up to constant factors. Numerically stable implementations may clamp

s_{i}

within a range (e.g., [-5, 5]) and optionally apply gradient clipping; both operations are O(T). In distributed or batched settings, the cost to reduce per-task losses remains O(T) and negligible relative to model compute. Therefore, uncertainty weighting adds minimal overhead and scales linearly with the number of tasks while preserving the asymptotic complexity of standard multi-task training.

Code Examples

Utility: Uncertainty weighter with log-variance parameters

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct UncertaintyWeighter {
5     // s_i = log(sigma_i^2) for each task
6     vector<double> s;
7     // Optional clamp range for numerical stability
8     double s_min = -5.0, s_max = 5.0;
9 
10     explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0)
11         : s(T, init_logvar) {}
12 
13     // Compute weights w_i = 0.5 * exp(-s_i)
14     vector<double> weights() const {
15         vector<double> w(s.size());
16         for (size_t i = 0; i < s.size(); ++i) {
17             w[i] = 0.5 * exp(-s[i]);
18         }
19         return w;
20     }
21 
22     // Combine task losses into total loss L = sum(0.5 * exp(-s_i) * L_i + 0.5 * s_i)
23     double combineLosses(const vector<double>& L) const {
24         if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
25         double total = 0.0;
26         for (size_t i = 0; i < s.size(); ++i) {
27             total += 0.5 * exp(-s[i]) * L[i] + 0.5 * s[i];
28         }
29         return total;
30     }
31 
32     // One SGD step on s_i with learning rate eta_s using gradient dL/ds_i = 0.5*(1 - exp(-s_i)*L_i)
33     void stepOnSigmas(const vector<double>& L, double eta_s) {
34         if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
35         for (size_t i = 0; i < s.size(); ++i) {
36             double grad = 0.5 * (1.0 - exp(-s[i]) * L[i]);
37             s[i] -= eta_s * grad; // gradient descent step
38             // Clamp for stability
39             s[i] = min(max(s[i], s_min), s_max);
40         }
41     }
42 };
43 
44 int main() {
45     // Example: two tasks with very different loss scales
46     UncertaintyWeighter uw(2 /*T*/);
47     // Initialize log-variances to 0 => sigma^2 = 1
48 
49     // Toy per-step losses (e.g., from two tasks). L1 is large, L2 is small.
50     vector<double> L = {100.0, 1.0};
51 
52     // Update s for a few iterations to see weights adapt
53     double eta_s = 0.1; // learning rate for s
54     for (int t = 0; t < 20; ++t) {
55         double total = uw.combineLosses(L);
56         auto w = uw.weights();
57         cout << "Step " << t
58              << ": L_total=" << total
59              << ", w=[" << w[0] << ", " << w[1] << "]"
60              << ", s=[" << uw.s[0] << ", " << uw.s[1] << "]\n";
61         uw.stepOnSigmas(L, eta_s);
62         // Imagine L changes over training; here it's fixed just to illustrate adaptation
63     }
64     return 0;
65 }
66

This standalone utility encapsulates the uncertainty-weighting mechanism. It stores log-variances s_i, computes weights w_i = 0.5 e^{-s_i}, forms the combined loss including the 0.5 s_i regularizer, and performs an SGD step on s_i using the closed-form gradient. The demo shows how, when one task’s loss is much larger, its weight is reduced automatically over iterations.

Time: O(T) per call to compute weights, combine losses, or update s.Space: O(T) to store log-variances and weights.

End-to-end toy training: Two-task linear regression with shared slope and learned uncertainty

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct UncertaintyWeighter {
5     vector<double> s; // log-variances s_i = log(sigma_i^2)
6     double s_min = -5.0, s_max = 5.0;
7     explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0) : s(T, init_logvar) {}
8     vector<double> weights() const { vector<double> w(s.size()); for (size_t i=0;i<s.size();++i) w[i]=0.5*exp(-s[i]); return w; }
9     double combineLosses(const vector<double>& L) const { double total=0.0; for (size_t i=0;i<s.size();++i) total += 0.5*exp(-s[i])*L[i] + 0.5*s[i]; return total; }
10     void stepOnSigmas(const vector<double>& L, double eta_s) { for (size_t i=0;i<s.size();++i){ double grad=0.5*(1.0 - exp(-s[i])*L[i]); s[i]-=eta_s*grad; s[i]=min(max(s[i],s_min),s_max);} }
11 };
12 
13 struct Dataset {
14     vector<double> x, y1, y2;
15 };
16 
17 Dataset make_dataset(int n, unsigned seed=42) {
18     mt19937 rng(seed);
19     uniform_real_distribution<double> U(-1.0, 1.0);
20     normal_distribution<double> noise1(0.0, 0.5); // lower noise task
21     normal_distribution<double> noise2(0.0, 2.0); // higher noise task
22     Dataset d; d.x.resize(n); d.y1.resize(n); d.y2.resize(n);
23     for (int i=0;i<n;++i) {
24         double xi = U(rng);
25         // True shared slope = 2.0, different biases
26         d.x[i] = xi;
27         d.y1[i] = 2.0*xi + 1.0 + noise1(rng);
28         d.y2[i] = 2.0*xi - 1.0 + noise2(rng);
29     }
30     return d;
31 }
32 
33 int main(){
34     ios::sync_with_stdio(false);
35     cin.tie(nullptr);
36 
37     // Model: shared slope 'a', task-specific biases 'b1', 'b2'
38     double a = 0.0, b1 = 0.0, b2 = 0.0;
39 
40     // Uncertainty weighter for 2 tasks
41     UncertaintyWeighter uw(2, 0.0); // start with sigma^2 = 1 => s=0
42 
43     // Hyperparameters
44     double lr_theta = 0.05; // learning rate for model params
45     double lr_s = 0.01;     // learning rate for log-variances
46     int epochs = 400;
47     int n = 512; // dataset size
48 
49     Dataset data = make_dataset(n);
50 
51     for (int epoch=1; epoch<=epochs; ++epoch) {
52         // Forward pass: predictions
53         vector<double> e1(n), e2(n);
54         double L1 = 0.0, L2 = 0.0; // MSE per task
55         for (int i=0;i<n;++i){
56             double y1_hat = a*data.x[i] + b1;
57             double y2_hat = a*data.x[i] + b2;
58             e1[i] = y1_hat - data.y1[i];
59             e2[i] = y2_hat - data.y2[i];
60             L1 += e1[i]*e1[i];
61             L2 += e2[i]*e2[i];
62         }
63         L1 /= n; L2 /= n; // mean squared error
64 
65         // Weights from uncertainty
66         auto w = uw.weights();
67 
68         // Gradients of MSE wrt params (before weighting)
69         // dL_i/da = (2/n) * sum(e_i * x), dL_i/db_i = (2/n) * sum(e_i)
70         double dL1_da = 0.0, dL2_da = 0.0, dL1_db1 = 0.0, dL2_db2 = 0.0;
71         for (int i=0;i<n;++i){
72             dL1_da += e1[i]*data.x[i];
73             dL2_da += e2[i]*data.x[i];
74             dL1_db1 += e1[i];
75             dL2_db2 += e2[i];
76         }
77         dL1_da *= (2.0/n); dL2_da *= (2.0/n);
78         dL1_db1 *= (2.0/n); dL2_db2 *= (2.0/n);
79 
80         // Weighted gradients for shared and task-specific params
81         double dL_da = w[0]*dL1_da + w[1]*dL2_da;
82         double dL_db1 = w[0]*dL1_db1;
83         double dL_db2 = w[1]*dL2_db2;
84 
85         // Parameter updates (SGD)
86         a  -= lr_theta * dL_da;
87         b1 -= lr_theta * dL_db1;
88         b2 -= lr_theta * dL_db2;
89 
90         // Update uncertainty parameters s_i using closed-form gradient
91         uw.stepOnSigmas({L1, L2}, lr_s);
92 
93         if (epoch % 50 == 0) {
94             double totalLoss = uw.combineLosses({L1, L2});
95             cout << fixed << setprecision(4)
96                  << "Epoch " << setw(3) << epoch
97                  << " | L1=" << L1 << ", L2=" << L2
98                  << " | w1=" << w[0] << ", w2=" << w[1]
99                  << " | a=" << a << ", b1=" << b1 << ", b2=" << b2
100                  << " | s1=" << uw.s[0] << ", s2=" << uw.s[1]
101                  << " | L_total=" << totalLoss
102                  << "\n";
103         }
104     }
105 
106     return 0;
107 }
108

This is a complete toy multi-task training loop without external libraries. Two regression tasks share the slope parameter and have different noise levels. We compute per-task MSE, form weighted gradients with w_i = 0.5 e^{-s_i}, update the model parameters, and separately update the log-variances s_i using the closed-form gradient. The learned weights down-weight the noisier task, allowing the model to focus on the more reliable signal while still learning from both.

Time: Per epoch O(n·T) for loss and gradient accumulation (here T=2), plus O(T) for uncertainty updates.Space: O(n) to store residuals for a minibatch and O(T) for uncertainty parameters.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct UncertaintyWeighter {
5	// s_i = log(sigma_i^2) for each task
6	vector<double> s;
7	// Optional clamp range for numerical stability
8	double s_min = -5.0, s_max = 5.0;
9
10	explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0)
11	: s(T, init_logvar) {}
12
13	// Compute weights w_i = 0.5 * exp(-s_i)
14	vector<double> weights() const {
15	vector<double> w(s.size());
16	for (size_t i = 0; i < s.size(); ++i) {
17	w[i] = 0.5 * exp(-s[i]);
18	}
19	return w;
20	}
21
22	// Combine task losses into total loss L = sum(0.5 * exp(-s_i) * L_i + 0.5 * s_i)
23	double combineLosses(const vector<double>& L) const {
24	if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
25	double total = 0.0;
26	for (size_t i = 0; i < s.size(); ++i) {
27	total += 0.5 * exp(-s[i]) * L[i] + 0.5 * s[i];
28	}
29	return total;
30	}
31
32	// One SGD step on s_i with learning rate eta_s using gradient dL/ds_i = 0.5(1 - exp(-s_i)L_i)
33	void stepOnSigmas(const vector<double>& L, double eta_s) {
34	if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
35	for (size_t i = 0; i < s.size(); ++i) {
36	double grad = 0.5 * (1.0 - exp(-s[i]) * L[i]);
37	s[i] -= eta_s * grad; // gradient descent step
38	// Clamp for stability
39	s[i] = min(max(s[i], s_min), s_max);
40	}
41	}
42	};
43
44	int main() {
45	// Example: two tasks with very different loss scales
46	UncertaintyWeighter uw(2 /T/);
47	// Initialize log-variances to 0 => sigma^2 = 1
48
49	// Toy per-step losses (e.g., from two tasks). L1 is large, L2 is small.
50	vector<double> L = {100.0, 1.0};
51
52	// Update s for a few iterations to see weights adapt
53	double eta_s = 0.1; // learning rate for s
54	for (int t = 0; t < 20; ++t) {
55	double total = uw.combineLosses(L);
56	auto w = uw.weights();
57	cout << "Step " << t
58	<< ": L_total=" << total
59	<< ", w=[" << w[0] << ", " << w[1] << "]"
60	<< ", s=[" << uw.s[0] << ", " << uw.s[1] << "]\n";
61	uw.stepOnSigmas(L, eta_s);
62	// Imagine L changes over training; here it's fixed just to illustrate adaptation
63	}
64	return 0;
65	}
66