Weight Initialization Strategies
Key Points
- •Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.
- •Xavier/Glorot initialization preserves variance for symmetric activations like tanh/sigmoid by using variance 2/(fa + fa).
- •He/Kaiming initialization preserves variance for ReLU-family activations by using variance 2/fa to counteract ReLU’s 50% zeroing.
- •fa is the number of inputs to a neuron and fa is the number of outputs it feeds; both determine the correct scaling of initial weights.
- •Uniform and normal versions exist; their ranges/standard deviations are chosen so the weight variance matches the target formula.
- •Using an initialization mismatched to the activation (e.g., Xavier with ReLU) can lead to vanishing or exploding activations/gradients.
- •For convolutional layers, fa and fa must include kernel size and channel counts (e.g., i × × ).
- •Initialization cost is linear in the number of parameters, and memory use is dominated by storing the weights.
Prerequisites
- →Basic probability and variance — Understanding how sums of random variables combine variances is central to deriving initialization scales.
- →Linear algebra (vectors, matrices) — Layers compute z = W x + b; reasoning about dimensions and matrix operations requires matrix fundamentals.
- →Neural network activation functions — Initialization depends on how the activation changes variance (e.g., ReLU vs tanh).
- →Gradient-based optimization — Balanced initialization stabilizes gradient magnitudes, improving training with backpropagation.
- →Random number generation — Practical implementation uses uniform/normal RNGs with correct parameters and seeding.
- →Convolutional layers — Correctly computing fan_in/fan_out for conv kernels prevents scale errors in CNNs.
Detailed Explanation
Tap terms for definitions01Overview
Weight initialization strategies define how to choose the starting values of neural network parameters before training. Good initialization is crucial because neural networks are optimized with gradient-based methods that rely on stable signal propagation in both forward and backward passes. If initial weights are too large, activations and gradients can explode; if too small, they can vanish. Both issues slow learning or cause it to fail entirely. Two widely used strategies are Xavier (also called Glorot) initialization and He (also called Kaiming) initialization. Xavier aims to keep the variance of activations and gradients roughly constant across layers for activations with outputs centered around zero and without hard truncation (e.g., tanh, logistic sigmoid at moderate scales). He initialization modifies the scaling to account for ReLU-like activations that zero-out approximately half of their inputs, thereby halving variance unless compensated. These strategies provide closed-form formulas for the variance of the weight distribution, expressed in terms of fan_in (number of inputs) and fan_out (number of outputs) of a layer. They can be implemented using either uniform or normal distributions by matching the distribution's variance to the target formula. The result is faster convergence, more stable training, and reduced sensitivity to architecture depth.
02Intuition & Analogies
Imagine a network as a series of pipes carrying water (the signal). Each layer transforms and passes the flow onward. If the pipes suddenly narrow too much, flow dwindles to a trickle (vanishing activations/gradients). If they widen too much, water gushes uncontrollably (exploding activations/gradients). Proper weight initialization sets the initial pipe diameters so that flow is steady from the first layer to the last. Now think statistically: inputs to a neuron are many small random contributions (weights times activations). By the law of large numbers, their sum’s variance scales with how many terms you add and with the variance of each term. A layer with fan_in inputs adds up fan_in contributions. If each contribution has variance Var(W) × Var(input), then the sum’s variance is roughly fan_in × Var(W) × Var(input). To keep the signal’s magnitude stable across layers, we want the output variance to match the input variance. That determines how big Var(W) should be. ReLU adds a twist: it clips negative values to zero. If pre-activations are roughly symmetric around zero, ReLU keeps only positive halves. On average, that halves the variance going forward. He initialization compensates by doubling the pre-activation variance via choosing Var(W) = 2/fan_in, so after the ReLU’s halving, the variance ends up close to what it started as. For tanh/sigmoid, which are roughly symmetric and don’t zero out half the signal near the origin, the balanced choice is Var(W) = 2/(fan_in + fan_out), which keeps both forward activations and backward gradients in check. In short: choose the weight scale so the layer neither amplifies nor attenuates the flow of information.
03Formal Definition
04When to Use
Use Xavier/Glorot initialization when activations are approximately symmetric around zero and do not truncate half the distribution: tanh, linear, or modestly scaled sigmoid (often with batch normalization). It balances forward and backward variance through the term (fan_in + fan_out), which is helpful for deep fully connected networks with smooth nonlinearities. Use He/Kaiming initialization when activations are ReLU-like: ReLU, leaky ReLU, ELU, GELU (often treated similarly). These activations set many outputs to zero, so the variance needs to be larger to compensate; He initialization does exactly that with 2/fan_in. For leaky ReLU, use He with the gain factor g = \sqrt{2/(1+\alpha^2)}; for SELU, prefer LeCun normal (\operatorname{Var}(W) = 1/\text{fan_in}) to match its self-normalizing property. In convolutional networks, compute fan_in/fan_out carefully using kernel dimensions and channels. If batch normalization is present, initialization becomes slightly less critical but still matters for early training stability and speed. When experimenting with new activations, derive a gain that keeps \operatorname{Var}(a_l) roughly constant by analyzing E[\phi(z)] and E[\phi(z)^2] under z \sim \mathcal{N}(0, v).
⚠️Common Mistakes
• Mismatched activation and initializer: Using Xavier with ReLU can cause shrinking activations; using He with tanh can over-amplify. Always align the initializer with the nonlinearity’s variance effect. • Ignoring fan counts for convolutions: Forgetting to multiply by kernel area (k_h × k_w) leads to dramatically wrong scales. For 3D/1D convs, use the product of all spatial kernel sizes. • Confusing uniform and normal parameters: For uniform U[-a, a], the variance is a^2/3, not a^2. For normal, the standard deviation is \sigma, and the variance is \sigma^2. Match these to the target Var(W). • Forgetting bias initialization: Biases are often initialized to zero or small constants; applying the same scaling as weights can introduce unintended shifts. • Using integer RNGs or poor seeding: Weights should be floating-point, drawn from high-quality PRNGs. Seed deterministically for reproducibility when needed. • Ignoring gradient-side considerations: Xavier balances forward and backward; choosing 2/fan_in without ReLU-like nonlinearity may stabilize forward pass but destabilize gradients. • Reusing initializations after shape changes: If you change layer widths or kernel sizes, recompute fan_in/fan_out and reinitialize. • Overlooking normalization layers: BatchNorm can mask poor initialization but not fix extreme scales; still choose a principled initializer.
Key Formulas
Uniform Variance
Explanation: The variance of a uniform distribution from -a to a equals divided by 3. This lets us match a uniform initializer’s variance to a target value by solving for a.
Glorot Normal Stddev
Explanation: For Xavier/Glorot normal, choose a zero-mean normal with standard deviation sqrt(2/(fa + fa)). This preserves variance for tanh/sigmoid-like activations.
Glorot Uniform Bound
Explanation: For Xavier/Glorot uniform, draw from U[-a, a] with )). This yields variance 2/(fa + fa).
He Normal Stddev
Explanation: For He/Kaiming normal, choose a zero-mean normal with standard deviation sqrt(2/fa). This compensates for ReLU halving the variance.
He Uniform Bound
Explanation: For He/Kaiming uniform, draw from U[-a, a] with ). The uniform’s variance /3 equals 2/fa.
Pre-activation Variance
Explanation: Assuming independence and zero mean, each pre-activation component accumulates fa independent contributions. This relationship motivates how we scale Var(W).
ReLU Variance Halving
Explanation: Under zero-mean Gaussian inputs, ReLU keeps only positive values, reducing variance by half on average. He initialization compensates for this effect.
Glorot Target Variance
Explanation: This target variance balances forward and backward signal magnitudes across layers for symmetric activations.
He Target Variance
Explanation: This target variance preserves forward activation variance for ReLU-like activations that zero out about half the inputs.
Leaky ReLU Gain
Explanation: The gain adjusts the initializer for leaky ReLU with negative slope Use σ = g/√fa for normal or set uniform bounds accordingly.
2D Convolution Fans
Explanation: For 2D convolutions, include both channel counts and kernel area when computing fa and fa before applying Glorot/He formulas.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute sample mean and variance of a vector 5 pair<double,double> mean_var(const vector<double>& v) { 6 double mean = 0.0; double M2 = 0.0; size_t n = 0; 7 for (double x : v) { 8 n++; 9 double delta = x - mean; 10 mean += delta / n; 11 double delta2 = x - mean; 12 M2 += delta * delta2; 13 } 14 double var = (n > 1) ? M2 / (n - 1) : 0.0; // unbiased estimate 15 return {mean, var}; 16 } 17 18 // Initialize W with Glorot (Xavier) uniform: U[-a, a], a = sqrt(6/(fan_in+fan_out)) 19 void glorot_uniform(vector<double>& W, int fan_in, int fan_out, mt19937& rng) { 20 double a = sqrt(6.0 / (fan_in + fan_out)); 21 uniform_real_distribution<double> dist(-a, a); 22 for (double &w : W) w = dist(rng); 23 } 24 25 // Initialize W with Glorot (Xavier) normal: N(0, sigma^2), sigma = sqrt(2/(fan_in+fan_out)) 26 void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) { 27 double sigma = sqrt(2.0 / (fan_in + fan_out)); 28 normal_distribution<double> dist(0.0, sigma); 29 for (double &w : W) w = dist(rng); 30 } 31 32 // Initialize W with He (Kaiming) uniform: U[-a, a], a = sqrt(6/fan_in) 33 void he_uniform(vector<double>& W, int fan_in, mt19937& rng) { 34 double a = sqrt(6.0 / fan_in); 35 uniform_real_distribution<double> dist(-a, a); 36 for (double &w : W) w = dist(rng); 37 } 38 39 // Initialize W with He (Kaiming) normal: N(0, sigma^2), sigma = sqrt(2/fan_in) 40 void he_normal(vector<double>& W, int fan_in, mt19937& rng) { 41 double sigma = sqrt(2.0 / fan_in); 42 normal_distribution<double> dist(0.0, sigma); 43 for (double &w : W) w = dist(rng); 44 } 45 46 int main() { 47 // Example: Dense layer 100 inputs -> 100 outputs 48 int fan_in = 100, fan_out = 100; 49 size_t num_params = static_cast<size_t>(fan_in) * fan_out; 50 51 // RNG 52 random_device rd; 53 mt19937 rng(rd()); 54 55 // Buffers 56 vector<double> W(num_params); 57 58 // Glorot uniform 59 glorot_uniform(W, fan_in, fan_out, rng); 60 auto [m1, v1] = mean_var(W); 61 cout << "Glorot uniform: mean=" << m1 << ", var~=" << v1 62 << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n"; 63 64 // Glorot normal 65 glorot_normal(W, fan_in, fan_out, rng); 66 auto [m2, v2] = mean_var(W); 67 cout << "Glorot normal: mean=" << m2 << ", var~=" << v2 68 << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n"; 69 70 // He uniform 71 he_uniform(W, fan_in, rng); 72 auto [m3, v3] = mean_var(W); 73 cout << "He uniform: mean=" << m3 << ", var~=" << v3 74 << " (theory= " << 2.0 / fan_in << ")\n"; 75 76 // He normal 77 he_normal(W, fan_in, rng); 78 auto [m4, v4] = mean_var(W); 79 cout << "He normal: mean=" << m4 << ", var~=" << v4 80 << " (theory= " << 2.0 / fan_in << ")\n"; 81 82 return 0; 83 } 84
This program implements Glorot and He initializers (both uniform and normal). It initializes a 100×100 dense layer four times and prints the sample mean/variance against the theoretical target. Minor deviations are expected due to finite sampling.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Simple activations 5 inline double relu(double x) { return x > 0.0 ? x : 0.0; } 6 inline double tanh_act(double x) { return tanh(x); } 7 8 typedef function<double(double)> ActFn; 9 10 // Matrix-vector multiply: y = W x (W is rows x cols) 11 void matvec(const vector<double>& W, int rows, int cols, const vector<double>& x, vector<double>& y) { 12 fill(y.begin(), y.end(), 0.0); 13 for (int r = 0; r < rows; ++r) { 14 double sum = 0.0; 15 const double* wrow = &W[static_cast<size_t>(r) * cols]; 16 for (int c = 0; c < cols; ++c) sum += wrow[c] * x[c]; 17 y[r] = sum; 18 } 19 } 20 21 // Initializers 22 void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) { 23 double sigma = sqrt(2.0 / (fan_in + fan_out)); 24 normal_distribution<double> dist(0.0, sigma); 25 for (double &w : W) w = dist(rng); 26 } 27 void he_normal(vector<double>& W, int fan_in, mt19937& rng) { 28 double sigma = sqrt(2.0 / fan_in); 29 normal_distribution<double> dist(0.0, sigma); 30 for (double &w : W) w = dist(rng); 31 } 32 33 pair<double,double> mean_var(const vector<double>& v) { 34 double mean = 0.0; double M2 = 0.0; size_t n = 0; 35 for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); } 36 double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var}; 37 } 38 39 // Build an L-layer MLP with width H and test forward variance stability 40 void test_pipeline(int H, int L, ActFn act, bool use_he_for_relu, mt19937& rng) { 41 vector<vector<double>> weights(L); 42 // Square layers HxH for simplicity 43 for (int l = 0; l < L; ++l) { 44 weights[l].resize(static_cast<size_t>(H) * H); 45 if (use_he_for_relu) he_normal(weights[l], H, rng); 46 else glorot_normal(weights[l], H, H, rng); 47 } 48 49 // Random zero-mean unit-variance input 50 vector<double> x(H), z(H); 51 normal_distribution<double> dist_in(0.0, 1.0); 52 for (int i = 0; i < H; ++i) x[i] = dist_in(rng); 53 54 auto mv = mean_var(x); 55 cout << "Layer 0 (input): mean=" << mv.first << ", var=" << mv.second << "\n"; 56 57 for (int l = 0; l < L; ++l) { 58 matvec(weights[l], H, H, x, z); // pre-activation 59 // Apply activation 60 for (int i = 0; i < H; ++i) z[i] = act(z[i]); 61 auto st = mean_var(z); 62 cout << "Layer " << (l+1) << ": mean=" << st.first << ", var=" << st.second << "\n"; 63 x.swap(z); 64 } 65 } 66 67 int main() { 68 random_device rd; mt19937 rng(rd()); 69 int H = 512; int L = 10; 70 71 cout << "--- ReLU with He (expected stable variance) ---\n"; 72 test_pipeline(H, L, relu, /*use_he_for_relu=*/true, rng); 73 74 cout << "\n--- ReLU with Glorot (variance may shrink) ---\n"; 75 test_pipeline(H, L, relu, /*use_he_for_relu=*/false, rng); 76 77 cout << "\n--- tanh with Glorot (expected stable near origin) ---\n"; 78 test_pipeline(H, L, tanh_act, /*use_he_for_relu=*/false, rng); 79 80 return 0; 81 } 82
This simulation builds a deep stack of square dense layers and compares how activation variance evolves when using matching vs mismatched initializations. ReLU paired with He shows roughly stable variance; ReLU with Glorot tends to shrink; tanh with Glorot is typically stable near the origin.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Compute sample mean and variance 5 pair<double,double> mean_var(const vector<double>& v) { 6 double mean = 0.0; double M2 = 0.0; size_t n = 0; 7 for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); } 8 double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var}; 9 } 10 11 // He uniform for Conv2D weights: shape [out_channels, in_channels, k_h, k_w] 12 void he_uniform_conv2d(vector<double>& W, int out_c, int in_c, int k_h, int k_w, mt19937& rng) { 13 long long fan_in = 1LL * in_c * k_h * k_w; 14 double a = sqrt(6.0 / static_cast<double>(fan_in)); 15 uniform_real_distribution<double> dist(-a, a); 16 for (double &w : W) w = dist(rng); 17 } 18 19 int main() { 20 int out_c = 64, in_c = 3, k_h = 3, k_w = 3; 21 size_t num_params = static_cast<size_t>(out_c) * in_c * k_h * k_w; 22 vector<double> W(num_params); 23 24 random_device rd; mt19937 rng(rd()); 25 he_uniform_conv2d(W, out_c, in_c, k_h, k_w, rng); 26 27 auto [m, v] = mean_var(W); 28 double fan_in = static_cast<double>(in_c * k_h * k_w); 29 cout << "Conv2D He uniform: mean=" << m << ", var~=" << v 30 << ", theory=" << 2.0 / fan_in << "\n"; 31 32 return 0; 33 } 34
This example initializes a 2D convolutional kernel tensor using He uniform. It computes fan_in = in_channels × k_h × k_w and matches the variance to 2/fan_in. The printed statistics should be close to theory for sufficiently large tensors.