🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Multi-Task Loss Balancing

Key Points

  • •
    Multi-task loss balancing aims to automatically set each task’s weight so that no single loss dominates training.
  • •
    Uncertainty weighting uses a learnable noise parameter σi​ per task and defines the weight as wi​=1/(2σi2​), with an extra log σi​ term to prevent trivial solutions.
  • •
    Reparameterizing with si​ = log σ_i2 makes optimization stable and ensures σi​ stays positive.
  • •
    At optimum, each task’s contribution scales inversely to its noise, so noisier tasks get down-weighted while cleaner tasks drive learning.
  • •
    The combined loss can be derived from maximum likelihood with Gaussian (and related) likelihoods, giving a principled probabilistic foundation.
  • •
    In practice you update model parameters with weighted gradients and update si​ with a simple closed-form gradient.
  • •
    Numerically, clamp or regularize si​ to avoid extreme weights and use log-variance to prevent negative σi​.
  • •
    Computational overhead is minimal: combining T task losses and updating T uncertainties is O(T) per step in time and space.

Prerequisites

  • →Gradient descent and backpropagation — Uncertainty weighting is optimized jointly with model parameters using gradient-based methods.
  • →Loss functions (MSE, cross-entropy) — You must compute per-task losses consistently to form the weighted sum.
  • →Probability and Gaussian likelihood — The uncertainty-weighted objective is derived from maximum likelihood with Gaussian noise.
  • →Exponential and logarithm functions — Reparameterization uses s_i = log σ_i^2 and weights use exp(-s_i).
  • →Numerical stability practices — Clamping, separate learning rates, and regularization prevent pathological weights.

Detailed Explanation

Tap terms for definitions

01Overview

Multi-task loss balancing addresses a common problem in multi-task learning: different tasks naturally produce losses on different scales. If we simply sum the raw losses, the largest-magnitude loss can dominate, causing the model to prioritize one task at the expense of others. Uncertainty weighting provides a principled way to balance tasks by introducing a learnable noise parameter for each task that sets its weight automatically during training. Concretely, for task i with loss L_i, we add a learnable parameter σ_i (interpreted as homoscedastic—input-independent—noise). The combined loss is a weighted sum \sum_i (1/(2σ_i^2)) L_i plus a stabilizing term \sum_i \log σ_i. This form arises naturally from maximum likelihood estimation under Gaussian (for regression) and related likelihoods (for classification) and yields a closed-form gradient for the uncertainty parameters. The effect is that tasks with higher inherent noise get smaller weights, letting the model focus on informative, lower-noise tasks without manual tuning of hyperparameters. The method is lightweight, differentiable, and straightforward to integrate into any gradient-based training loop.

02Intuition & Analogies

Imagine you’re trying to listen to several people speaking at once. Some are speaking clearly (low noise), others are mumbling (high noise). If you try to pay equal attention to all, the mumbling overwhelms you with confusion and you learn less from the clear speakers. A sensible strategy is to pay more attention to clear voices and less to the noisy ones. Uncertainty weighting does exactly this for machine learning tasks. Each task comes with its own level of background noise or difficulty. Instead of guessing how much to weigh each task, we let the model learn how noisy each task is. The learned noise becomes a volume knob: if a task is noisy, turn it down; if it’s crisp and reliable, turn it up. The clever trick is to represent the volume knob via σ_i, the task’s noise level, and define the task’s weight as 1/(2σ_i^2). This makes intuitive sense: if σ_i is large (very noisy), its weight shrinks; if σ_i is small (clean), its weight grows. But we must prevent the model from cheating by driving σ_i to zero (infinite weight). That’s why we add a small penalty term, \log σ_i, which grows when σ_i becomes tiny, keeping the game fair. By optimizing both the model parameters and these per-task noise knobs together, the learning automatically balances attention across tasks, focusing on where it can learn the most.

03Formal Definition

Consider T tasks, each with a per-batch loss Li​(θ), i=1, …, T, where θ are shared and task-specific model parameters. Assign each task a homoscedastic uncertainty parameter σ_i>0. The uncertainty-weighted total loss is L(θ, \{σi​}) = ∑i=1T​ \left( 2σi2​1​ Li​(θ) + log σi​ \right). This objective can be derived from maximum likelihood under Gaussian likelihoods for regression (and analogous forms for classification), where log σi​ acts as a regularizer ensuring identifiability and preventing σi​ → 0. For numerical stability, reparameterize with si​ = log σ_i2 ∈ R, giving L(θ, \{si​\}) = ∑i=1T​ \left( 21​ e−si​ Li​(θ) + 21​ si​ \right). The gradient with respect to si​ is ∂ L / ∂ si​ = 21​ (1 - e−si​ Li​(θ)), which yields a simple update rule in gradient descent. The gradient with respect to model parameters is a weighted sum: ∇θ​ L = ∑i=1T​ wi​ ∇θ​ Li​(θ), where wi​ = 21​ e−si​=1/(2σi2​). At stationarity (holding θ fixed), e−si​ Li​ ≈ 1, implying wi​ ∝ 1 / Li​, thus down-weighting larger-scale (noisier) losses.

04When to Use

Use uncertainty weighting in multi-task learning when tasks differ in loss scale, noise level, or difficulty and you do not want to hand-tune per-task weights. It is well-suited for joint regression tasks (e.g., depth and surface normal estimation), combinations of regression and classification (with appropriate likelihood forms), or any scenario where the noise level per task is roughly constant across inputs (homoscedastic). It is especially helpful early in training when task scales are very different and fixed heuristic weights can misguide learning. Avoid it when you have only a single task, when per-example (heteroscedastic) uncertainty is essential (then model σ as a function of inputs), or when task losses are already carefully normalized and comparable. Also consider that if tasks fundamentally conflict (negative transfer), weighting alone may be insufficient—techniques like gradient surgery, task routing, or separate optimizers might be needed in addition to or instead of uncertainty weighting.

⚠️Common Mistakes

  • Omitting the \log σ_i term: Without it, the model can drive σ_i \to 0, giving infinite weight to a task and collapsing training. Always include the regularizer.
  • Optimizing σ_i directly: Directly updating σ_i can make it negative or unstable. Optimize s_i = \log σ_i^2 instead, which guarantees σ_i > 0 via σ_i = e^{s_i/2}.
  • Extreme weights from unbounded s_i: Large negative s_i produce huge weights and gradient explosions. Clamp s_i within a reasonable range (e.g., [-5, 5]) or add mild L2 regularization.
  • Mixing batch scales inconsistently: Compute L_i consistently (e.g., mean over batch). If L_i has different implicit scaling per batch, weights may fluctuate wildly.
  • Wrong learning rates: The uncertainty parameters s_i often need a smaller/independent learning rate than the model parameters to avoid oscillation.
  • Comparing raw weights across tasks without context: Weights adapt to current losses; transient spikes are normal early in training.
  • Forgetting that homoscedasticity is an assumption: If uncertainty varies per input, consider heteroscedastic models with σ(x) instead of a single σ_i per task.

Key Formulas

Uncertainty-weighted multi-task loss

L(θ,{σi​})=i=1∑T​(2σi2​1​Li​(θ)+logσi​)

Explanation: Each task i contributes its loss scaled by 1/(2σi2​) plus a log σi​ penalty. This balances tasks and prevents σi​ from collapsing to zero.

Log-variance reparameterization and weight

si​=logσi2​,wi​=21​e−si​

Explanation: Optimizing si​ is unconstrained and numerically stable. The task weight is 0.5 times the exponential of negative log-variance.

Loss in log-variance form

L(θ,{si​})=i=1∑T​(21​e−si​Li​(θ)+21​si​)

Explanation: This is algebraically equivalent to the σ−parameterization but is preferred in practice due to stability and positivity of σ.

Gradient w.r.t. log-variance

∂si​∂L​=21​(1−e−si​Li​(θ))

Explanation: A simple, closed-form gradient drives si​ so that e−si​ Li​ ≈ 1 at equilibrium. It is easy to implement and stable.

Weighted gradient for model parameters

∇θ​L=i=1∑T​wi​∇θ​Li​(θ)

Explanation: Model parameters are updated with a weighted sum of per-task gradients, scaled by the learned weights wi​.

Gaussian NLL

−logp(y∣f,σ2)=2σ21​∥y−f∥22​+21​log(2πσ2)

Explanation: The negative log-likelihood for Gaussian noise decomposes into a quadratic error term and a log-variance term. Dropping constants yields the uncertainty-weighted loss form.

Weight definition

wi​=2σi2​1​=21​e−si​

Explanation: Weights are inversely proportional to variance; higher-noise tasks are down-weighted automatically during training.

Stationarity condition

e−si​Li​(θ)≈1⇒wi​∝Li​(θ)1​

Explanation: At a stationary point (for fixed θ), each task’s scaled loss tends to 1, showing that the method roughly inverts loss scales.

Complexity Analysis

Let T be the number of tasks and B the batch size used to compute each Li​. The additional computation to perform uncertainty weighting per step consists of: (1) computing the scalar weights wi​=0.5 e−si​ for all tasks, (2) forming the weighted sum of losses, and (3) updating the T log-variance parameters si​ by their gradients. Each of these steps is O(T) time and O(T) space; the overall training step remains dominated by the forward and backward passes of the model, which are typically O(B · C) where C captures model-dependent costs (e.g., number of parameters and operations per example). Memory overhead is small: you store T scalars for si​ and optionally the T weights. Backpropagation through the weighted sum simply scales each task’s gradient by wi​, so the backward complexity is unchanged up to constant factors. Numerically stable implementations may clamp si​ within a range (e.g., [-5, 5]) and optionally apply gradient clipping; both operations are O(T). In distributed or batched settings, the cost to reduce per-task losses remains O(T) and negligible relative to model compute. Therefore, uncertainty weighting adds minimal overhead and scales linearly with the number of tasks while preserving the asymptotic complexity of standard multi-task training.

Code Examples

Utility: Uncertainty weighter with log-variance parameters
1#include <bits/stdc++.h>
2using namespace std;
3
4struct UncertaintyWeighter {
5 // s_i = log(sigma_i^2) for each task
6 vector<double> s;
7 // Optional clamp range for numerical stability
8 double s_min = -5.0, s_max = 5.0;
9
10 explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0)
11 : s(T, init_logvar) {}
12
13 // Compute weights w_i = 0.5 * exp(-s_i)
14 vector<double> weights() const {
15 vector<double> w(s.size());
16 for (size_t i = 0; i < s.size(); ++i) {
17 w[i] = 0.5 * exp(-s[i]);
18 }
19 return w;
20 }
21
22 // Combine task losses into total loss L = sum(0.5 * exp(-s_i) * L_i + 0.5 * s_i)
23 double combineLosses(const vector<double>& L) const {
24 if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
25 double total = 0.0;
26 for (size_t i = 0; i < s.size(); ++i) {
27 total += 0.5 * exp(-s[i]) * L[i] + 0.5 * s[i];
28 }
29 return total;
30 }
31
32 // One SGD step on s_i with learning rate eta_s using gradient dL/ds_i = 0.5*(1 - exp(-s_i)*L_i)
33 void stepOnSigmas(const vector<double>& L, double eta_s) {
34 if (L.size() != s.size()) throw runtime_error("Mismatched sizes");
35 for (size_t i = 0; i < s.size(); ++i) {
36 double grad = 0.5 * (1.0 - exp(-s[i]) * L[i]);
37 s[i] -= eta_s * grad; // gradient descent step
38 // Clamp for stability
39 s[i] = min(max(s[i], s_min), s_max);
40 }
41 }
42};
43
44int main() {
45 // Example: two tasks with very different loss scales
46 UncertaintyWeighter uw(2 /*T*/);
47 // Initialize log-variances to 0 => sigma^2 = 1
48
49 // Toy per-step losses (e.g., from two tasks). L1 is large, L2 is small.
50 vector<double> L = {100.0, 1.0};
51
52 // Update s for a few iterations to see weights adapt
53 double eta_s = 0.1; // learning rate for s
54 for (int t = 0; t < 20; ++t) {
55 double total = uw.combineLosses(L);
56 auto w = uw.weights();
57 cout << "Step " << t
58 << ": L_total=" << total
59 << ", w=[" << w[0] << ", " << w[1] << "]"
60 << ", s=[" << uw.s[0] << ", " << uw.s[1] << "]\n";
61 uw.stepOnSigmas(L, eta_s);
62 // Imagine L changes over training; here it's fixed just to illustrate adaptation
63 }
64 return 0;
65}
66

This standalone utility encapsulates the uncertainty-weighting mechanism. It stores log-variances s_i, computes weights w_i = 0.5 e^{-s_i}, forms the combined loss including the 0.5 s_i regularizer, and performs an SGD step on s_i using the closed-form gradient. The demo shows how, when one task’s loss is much larger, its weight is reduced automatically over iterations.

Time: O(T) per call to compute weights, combine losses, or update s.Space: O(T) to store log-variances and weights.
End-to-end toy training: Two-task linear regression with shared slope and learned uncertainty
1#include <bits/stdc++.h>
2using namespace std;
3
4struct UncertaintyWeighter {
5 vector<double> s; // log-variances s_i = log(sigma_i^2)
6 double s_min = -5.0, s_max = 5.0;
7 explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0) : s(T, init_logvar) {}
8 vector<double> weights() const { vector<double> w(s.size()); for (size_t i=0;i<s.size();++i) w[i]=0.5*exp(-s[i]); return w; }
9 double combineLosses(const vector<double>& L) const { double total=0.0; for (size_t i=0;i<s.size();++i) total += 0.5*exp(-s[i])*L[i] + 0.5*s[i]; return total; }
10 void stepOnSigmas(const vector<double>& L, double eta_s) { for (size_t i=0;i<s.size();++i){ double grad=0.5*(1.0 - exp(-s[i])*L[i]); s[i]-=eta_s*grad; s[i]=min(max(s[i],s_min),s_max);} }
11};
12
13struct Dataset {
14 vector<double> x, y1, y2;
15};
16
17Dataset make_dataset(int n, unsigned seed=42) {
18 mt19937 rng(seed);
19 uniform_real_distribution<double> U(-1.0, 1.0);
20 normal_distribution<double> noise1(0.0, 0.5); // lower noise task
21 normal_distribution<double> noise2(0.0, 2.0); // higher noise task
22 Dataset d; d.x.resize(n); d.y1.resize(n); d.y2.resize(n);
23 for (int i=0;i<n;++i) {
24 double xi = U(rng);
25 // True shared slope = 2.0, different biases
26 d.x[i] = xi;
27 d.y1[i] = 2.0*xi + 1.0 + noise1(rng);
28 d.y2[i] = 2.0*xi - 1.0 + noise2(rng);
29 }
30 return d;
31}
32
33int main(){
34 ios::sync_with_stdio(false);
35 cin.tie(nullptr);
36
37 // Model: shared slope 'a', task-specific biases 'b1', 'b2'
38 double a = 0.0, b1 = 0.0, b2 = 0.0;
39
40 // Uncertainty weighter for 2 tasks
41 UncertaintyWeighter uw(2, 0.0); // start with sigma^2 = 1 => s=0
42
43 // Hyperparameters
44 double lr_theta = 0.05; // learning rate for model params
45 double lr_s = 0.01; // learning rate for log-variances
46 int epochs = 400;
47 int n = 512; // dataset size
48
49 Dataset data = make_dataset(n);
50
51 for (int epoch=1; epoch<=epochs; ++epoch) {
52 // Forward pass: predictions
53 vector<double> e1(n), e2(n);
54 double L1 = 0.0, L2 = 0.0; // MSE per task
55 for (int i=0;i<n;++i){
56 double y1_hat = a*data.x[i] + b1;
57 double y2_hat = a*data.x[i] + b2;
58 e1[i] = y1_hat - data.y1[i];
59 e2[i] = y2_hat - data.y2[i];
60 L1 += e1[i]*e1[i];
61 L2 += e2[i]*e2[i];
62 }
63 L1 /= n; L2 /= n; // mean squared error
64
65 // Weights from uncertainty
66 auto w = uw.weights();
67
68 // Gradients of MSE wrt params (before weighting)
69 // dL_i/da = (2/n) * sum(e_i * x), dL_i/db_i = (2/n) * sum(e_i)
70 double dL1_da = 0.0, dL2_da = 0.0, dL1_db1 = 0.0, dL2_db2 = 0.0;
71 for (int i=0;i<n;++i){
72 dL1_da += e1[i]*data.x[i];
73 dL2_da += e2[i]*data.x[i];
74 dL1_db1 += e1[i];
75 dL2_db2 += e2[i];
76 }
77 dL1_da *= (2.0/n); dL2_da *= (2.0/n);
78 dL1_db1 *= (2.0/n); dL2_db2 *= (2.0/n);
79
80 // Weighted gradients for shared and task-specific params
81 double dL_da = w[0]*dL1_da + w[1]*dL2_da;
82 double dL_db1 = w[0]*dL1_db1;
83 double dL_db2 = w[1]*dL2_db2;
84
85 // Parameter updates (SGD)
86 a -= lr_theta * dL_da;
87 b1 -= lr_theta * dL_db1;
88 b2 -= lr_theta * dL_db2;
89
90 // Update uncertainty parameters s_i using closed-form gradient
91 uw.stepOnSigmas({L1, L2}, lr_s);
92
93 if (epoch % 50 == 0) {
94 double totalLoss = uw.combineLosses({L1, L2});
95 cout << fixed << setprecision(4)
96 << "Epoch " << setw(3) << epoch
97 << " | L1=" << L1 << ", L2=" << L2
98 << " | w1=" << w[0] << ", w2=" << w[1]
99 << " | a=" << a << ", b1=" << b1 << ", b2=" << b2
100 << " | s1=" << uw.s[0] << ", s2=" << uw.s[1]
101 << " | L_total=" << totalLoss
102 << "\n";
103 }
104 }
105
106 return 0;
107}
108

This is a complete toy multi-task training loop without external libraries. Two regression tasks share the slope parameter and have different noise levels. We compute per-task MSE, form weighted gradients with w_i = 0.5 e^{-s_i}, update the model parameters, and separately update the log-variances s_i using the closed-form gradient. The learned weights down-weight the noisier task, allowing the model to focus on the more reliable signal while still learning from both.

Time: Per epoch O(n·T) for loss and gradient accumulation (here T=2), plus O(T) for uncertainty updates.Space: O(n) to store residuals for a minibatch and O(T) for uncertainty parameters.
#multi-task learning#uncertainty weighting#homoscedastic uncertainty#aleatoric uncertainty#log-variance#gaussian likelihood#weighted loss#task balancing#gradient descent#maximum likelihood#loss scaling#regularization#variance#mse#classification