Multi-Task Loss Balancing
Key Points
- •Multi-task loss balancing aims to automatically set each task’s weight so that no single loss dominates training.
- •Uncertainty weighting uses a learnable noise parameter per task and defines the weight as ), with an extra term to prevent trivial solutions.
- •Reparameterizing with = σ_ makes optimization stable and ensures stays positive.
- •At optimum, each task’s contribution scales inversely to its noise, so noisier tasks get down-weighted while cleaner tasks drive learning.
- •The combined loss can be derived from maximum likelihood with Gaussian (and related) likelihoods, giving a principled probabilistic foundation.
- •In practice you update model parameters with weighted gradients and update with a simple closed-form gradient.
- •Numerically, clamp or regularize to avoid extreme weights and use log-variance to prevent negative
- •Computational overhead is minimal: combining T task losses and updating T uncertainties is O(T) per step in time and space.
Prerequisites
- →Gradient descent and backpropagation — Uncertainty weighting is optimized jointly with model parameters using gradient-based methods.
- →Loss functions (MSE, cross-entropy) — You must compute per-task losses consistently to form the weighted sum.
- →Probability and Gaussian likelihood — The uncertainty-weighted objective is derived from maximum likelihood with Gaussian noise.
- →Exponential and logarithm functions — Reparameterization uses s_i = log σ_i^2 and weights use exp(-s_i).
- →Numerical stability practices — Clamping, separate learning rates, and regularization prevent pathological weights.
Detailed Explanation
Tap terms for definitions01Overview
Multi-task loss balancing addresses a common problem in multi-task learning: different tasks naturally produce losses on different scales. If we simply sum the raw losses, the largest-magnitude loss can dominate, causing the model to prioritize one task at the expense of others. Uncertainty weighting provides a principled way to balance tasks by introducing a learnable noise parameter for each task that sets its weight automatically during training. Concretely, for task i with loss L_i, we add a learnable parameter σ_i (interpreted as homoscedastic—input-independent—noise). The combined loss is a weighted sum \sum_i (1/(2σ_i^2)) L_i plus a stabilizing term \sum_i \log σ_i. This form arises naturally from maximum likelihood estimation under Gaussian (for regression) and related likelihoods (for classification) and yields a closed-form gradient for the uncertainty parameters. The effect is that tasks with higher inherent noise get smaller weights, letting the model focus on informative, lower-noise tasks without manual tuning of hyperparameters. The method is lightweight, differentiable, and straightforward to integrate into any gradient-based training loop.
02Intuition & Analogies
Imagine you’re trying to listen to several people speaking at once. Some are speaking clearly (low noise), others are mumbling (high noise). If you try to pay equal attention to all, the mumbling overwhelms you with confusion and you learn less from the clear speakers. A sensible strategy is to pay more attention to clear voices and less to the noisy ones. Uncertainty weighting does exactly this for machine learning tasks. Each task comes with its own level of background noise or difficulty. Instead of guessing how much to weigh each task, we let the model learn how noisy each task is. The learned noise becomes a volume knob: if a task is noisy, turn it down; if it’s crisp and reliable, turn it up. The clever trick is to represent the volume knob via σ_i, the task’s noise level, and define the task’s weight as 1/(2σ_i^2). This makes intuitive sense: if σ_i is large (very noisy), its weight shrinks; if σ_i is small (clean), its weight grows. But we must prevent the model from cheating by driving σ_i to zero (infinite weight). That’s why we add a small penalty term, \log σ_i, which grows when σ_i becomes tiny, keeping the game fair. By optimizing both the model parameters and these per-task noise knobs together, the learning automatically balances attention across tasks, focusing on where it can learn the most.
03Formal Definition
04When to Use
Use uncertainty weighting in multi-task learning when tasks differ in loss scale, noise level, or difficulty and you do not want to hand-tune per-task weights. It is well-suited for joint regression tasks (e.g., depth and surface normal estimation), combinations of regression and classification (with appropriate likelihood forms), or any scenario where the noise level per task is roughly constant across inputs (homoscedastic). It is especially helpful early in training when task scales are very different and fixed heuristic weights can misguide learning. Avoid it when you have only a single task, when per-example (heteroscedastic) uncertainty is essential (then model σ as a function of inputs), or when task losses are already carefully normalized and comparable. Also consider that if tasks fundamentally conflict (negative transfer), weighting alone may be insufficient—techniques like gradient surgery, task routing, or separate optimizers might be needed in addition to or instead of uncertainty weighting.
⚠️Common Mistakes
- Omitting the \log σ_i term: Without it, the model can drive σ_i \to 0, giving infinite weight to a task and collapsing training. Always include the regularizer.
- Optimizing σ_i directly: Directly updating σ_i can make it negative or unstable. Optimize s_i = \log σ_i^2 instead, which guarantees σ_i > 0 via σ_i = e^{s_i/2}.
- Extreme weights from unbounded s_i: Large negative s_i produce huge weights and gradient explosions. Clamp s_i within a reasonable range (e.g., [-5, 5]) or add mild L2 regularization.
- Mixing batch scales inconsistently: Compute L_i consistently (e.g., mean over batch). If L_i has different implicit scaling per batch, weights may fluctuate wildly.
- Wrong learning rates: The uncertainty parameters s_i often need a smaller/independent learning rate than the model parameters to avoid oscillation.
- Comparing raw weights across tasks without context: Weights adapt to current losses; transient spikes are normal early in training.
- Forgetting that homoscedasticity is an assumption: If uncertainty varies per input, consider heteroscedastic models with σ(x) instead of a single σ_i per task.
Key Formulas
Uncertainty-weighted multi-task loss
Explanation: Each task i contributes its loss scaled by 1/(2 plus a log penalty. This balances tasks and prevents from collapsing to zero.
Log-variance reparameterization and weight
Explanation: Optimizing is unconstrained and numerically stable. The task weight is 0.5 times the exponential of negative log-variance.
Loss in log-variance form
Explanation: This is algebraically equivalent to the but is preferred in practice due to stability and positivity of
Gradient w.r.t. log-variance
Explanation: A simple, closed-form gradient drives so that ≈ 1 at equilibrium. It is easy to implement and stable.
Weighted gradient for model parameters
Explanation: Model parameters are updated with a weighted sum of per-task gradients, scaled by the learned weights .
Gaussian NLL
Explanation: The negative log-likelihood for Gaussian noise decomposes into a quadratic error term and a log-variance term. Dropping constants yields the uncertainty-weighted loss form.
Weight definition
Explanation: Weights are inversely proportional to variance; higher-noise tasks are down-weighted automatically during training.
Stationarity condition
Explanation: At a stationary point (for fixed each task’s scaled loss tends to 1, showing that the method roughly inverts loss scales.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct UncertaintyWeighter { 5 // s_i = log(sigma_i^2) for each task 6 vector<double> s; 7 // Optional clamp range for numerical stability 8 double s_min = -5.0, s_max = 5.0; 9 10 explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0) 11 : s(T, init_logvar) {} 12 13 // Compute weights w_i = 0.5 * exp(-s_i) 14 vector<double> weights() const { 15 vector<double> w(s.size()); 16 for (size_t i = 0; i < s.size(); ++i) { 17 w[i] = 0.5 * exp(-s[i]); 18 } 19 return w; 20 } 21 22 // Combine task losses into total loss L = sum(0.5 * exp(-s_i) * L_i + 0.5 * s_i) 23 double combineLosses(const vector<double>& L) const { 24 if (L.size() != s.size()) throw runtime_error("Mismatched sizes"); 25 double total = 0.0; 26 for (size_t i = 0; i < s.size(); ++i) { 27 total += 0.5 * exp(-s[i]) * L[i] + 0.5 * s[i]; 28 } 29 return total; 30 } 31 32 // One SGD step on s_i with learning rate eta_s using gradient dL/ds_i = 0.5*(1 - exp(-s_i)*L_i) 33 void stepOnSigmas(const vector<double>& L, double eta_s) { 34 if (L.size() != s.size()) throw runtime_error("Mismatched sizes"); 35 for (size_t i = 0; i < s.size(); ++i) { 36 double grad = 0.5 * (1.0 - exp(-s[i]) * L[i]); 37 s[i] -= eta_s * grad; // gradient descent step 38 // Clamp for stability 39 s[i] = min(max(s[i], s_min), s_max); 40 } 41 } 42 }; 43 44 int main() { 45 // Example: two tasks with very different loss scales 46 UncertaintyWeighter uw(2 /*T*/); 47 // Initialize log-variances to 0 => sigma^2 = 1 48 49 // Toy per-step losses (e.g., from two tasks). L1 is large, L2 is small. 50 vector<double> L = {100.0, 1.0}; 51 52 // Update s for a few iterations to see weights adapt 53 double eta_s = 0.1; // learning rate for s 54 for (int t = 0; t < 20; ++t) { 55 double total = uw.combineLosses(L); 56 auto w = uw.weights(); 57 cout << "Step " << t 58 << ": L_total=" << total 59 << ", w=[" << w[0] << ", " << w[1] << "]" 60 << ", s=[" << uw.s[0] << ", " << uw.s[1] << "]\n"; 61 uw.stepOnSigmas(L, eta_s); 62 // Imagine L changes over training; here it's fixed just to illustrate adaptation 63 } 64 return 0; 65 } 66
This standalone utility encapsulates the uncertainty-weighting mechanism. It stores log-variances s_i, computes weights w_i = 0.5 e^{-s_i}, forms the combined loss including the 0.5 s_i regularizer, and performs an SGD step on s_i using the closed-form gradient. The demo shows how, when one task’s loss is much larger, its weight is reduced automatically over iterations.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 struct UncertaintyWeighter { 5 vector<double> s; // log-variances s_i = log(sigma_i^2) 6 double s_min = -5.0, s_max = 5.0; 7 explicit UncertaintyWeighter(size_t T, double init_logvar = 0.0) : s(T, init_logvar) {} 8 vector<double> weights() const { vector<double> w(s.size()); for (size_t i=0;i<s.size();++i) w[i]=0.5*exp(-s[i]); return w; } 9 double combineLosses(const vector<double>& L) const { double total=0.0; for (size_t i=0;i<s.size();++i) total += 0.5*exp(-s[i])*L[i] + 0.5*s[i]; return total; } 10 void stepOnSigmas(const vector<double>& L, double eta_s) { for (size_t i=0;i<s.size();++i){ double grad=0.5*(1.0 - exp(-s[i])*L[i]); s[i]-=eta_s*grad; s[i]=min(max(s[i],s_min),s_max);} } 11 }; 12 13 struct Dataset { 14 vector<double> x, y1, y2; 15 }; 16 17 Dataset make_dataset(int n, unsigned seed=42) { 18 mt19937 rng(seed); 19 uniform_real_distribution<double> U(-1.0, 1.0); 20 normal_distribution<double> noise1(0.0, 0.5); // lower noise task 21 normal_distribution<double> noise2(0.0, 2.0); // higher noise task 22 Dataset d; d.x.resize(n); d.y1.resize(n); d.y2.resize(n); 23 for (int i=0;i<n;++i) { 24 double xi = U(rng); 25 // True shared slope = 2.0, different biases 26 d.x[i] = xi; 27 d.y1[i] = 2.0*xi + 1.0 + noise1(rng); 28 d.y2[i] = 2.0*xi - 1.0 + noise2(rng); 29 } 30 return d; 31 } 32 33 int main(){ 34 ios::sync_with_stdio(false); 35 cin.tie(nullptr); 36 37 // Model: shared slope 'a', task-specific biases 'b1', 'b2' 38 double a = 0.0, b1 = 0.0, b2 = 0.0; 39 40 // Uncertainty weighter for 2 tasks 41 UncertaintyWeighter uw(2, 0.0); // start with sigma^2 = 1 => s=0 42 43 // Hyperparameters 44 double lr_theta = 0.05; // learning rate for model params 45 double lr_s = 0.01; // learning rate for log-variances 46 int epochs = 400; 47 int n = 512; // dataset size 48 49 Dataset data = make_dataset(n); 50 51 for (int epoch=1; epoch<=epochs; ++epoch) { 52 // Forward pass: predictions 53 vector<double> e1(n), e2(n); 54 double L1 = 0.0, L2 = 0.0; // MSE per task 55 for (int i=0;i<n;++i){ 56 double y1_hat = a*data.x[i] + b1; 57 double y2_hat = a*data.x[i] + b2; 58 e1[i] = y1_hat - data.y1[i]; 59 e2[i] = y2_hat - data.y2[i]; 60 L1 += e1[i]*e1[i]; 61 L2 += e2[i]*e2[i]; 62 } 63 L1 /= n; L2 /= n; // mean squared error 64 65 // Weights from uncertainty 66 auto w = uw.weights(); 67 68 // Gradients of MSE wrt params (before weighting) 69 // dL_i/da = (2/n) * sum(e_i * x), dL_i/db_i = (2/n) * sum(e_i) 70 double dL1_da = 0.0, dL2_da = 0.0, dL1_db1 = 0.0, dL2_db2 = 0.0; 71 for (int i=0;i<n;++i){ 72 dL1_da += e1[i]*data.x[i]; 73 dL2_da += e2[i]*data.x[i]; 74 dL1_db1 += e1[i]; 75 dL2_db2 += e2[i]; 76 } 77 dL1_da *= (2.0/n); dL2_da *= (2.0/n); 78 dL1_db1 *= (2.0/n); dL2_db2 *= (2.0/n); 79 80 // Weighted gradients for shared and task-specific params 81 double dL_da = w[0]*dL1_da + w[1]*dL2_da; 82 double dL_db1 = w[0]*dL1_db1; 83 double dL_db2 = w[1]*dL2_db2; 84 85 // Parameter updates (SGD) 86 a -= lr_theta * dL_da; 87 b1 -= lr_theta * dL_db1; 88 b2 -= lr_theta * dL_db2; 89 90 // Update uncertainty parameters s_i using closed-form gradient 91 uw.stepOnSigmas({L1, L2}, lr_s); 92 93 if (epoch % 50 == 0) { 94 double totalLoss = uw.combineLosses({L1, L2}); 95 cout << fixed << setprecision(4) 96 << "Epoch " << setw(3) << epoch 97 << " | L1=" << L1 << ", L2=" << L2 98 << " | w1=" << w[0] << ", w2=" << w[1] 99 << " | a=" << a << ", b1=" << b1 << ", b2=" << b2 100 << " | s1=" << uw.s[0] << ", s2=" << uw.s[1] 101 << " | L_total=" << totalLoss 102 << "\n"; 103 } 104 } 105 106 return 0; 107 } 108
This is a complete toy multi-task training loop without external libraries. Two regression tasks share the slope parameter and have different noise levels. We compute per-task MSE, form weighted gradients with w_i = 0.5 e^{-s_i}, update the model parameters, and separately update the log-variances s_i using the closed-form gradient. The learned weights down-weight the noisier task, allowing the model to focus on the more reliable signal while still learning from both.