🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Weight Initialization Strategies

Key Points

  • •
    Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.
  • •
    Xavier/Glorot initialization preserves variance for symmetric activations like tanh/sigmoid by using variance 2/(fani​n + fano​ut).
  • •
    He/Kaiming initialization preserves variance for ReLU-family activations by using variance 2/fani​n to counteract ReLU’s 50% zeroing.
  • •
    fani​n is the number of inputs to a neuron and fano​ut is the number of outputs it feeds; both determine the correct scaling of initial weights.
  • •
    Uniform and normal versions exist; their ranges/standard deviations are chosen so the weight variance matches the target formula.
  • •
    Using an initialization mismatched to the activation (e.g., Xavier with ReLU) can lead to vanishing or exploding activations/gradients.
  • •
    For convolutional layers, fani​n and fano​ut must include kernel size and channel counts (e.g., inc​hannels × kh​ × kw​).
  • •
    Initialization cost is linear in the number of parameters, and memory use is dominated by storing the weights.

Prerequisites

  • →Basic probability and variance — Understanding how sums of random variables combine variances is central to deriving initialization scales.
  • →Linear algebra (vectors, matrices) — Layers compute z = W x + b; reasoning about dimensions and matrix operations requires matrix fundamentals.
  • →Neural network activation functions — Initialization depends on how the activation changes variance (e.g., ReLU vs tanh).
  • →Gradient-based optimization — Balanced initialization stabilizes gradient magnitudes, improving training with backpropagation.
  • →Random number generation — Practical implementation uses uniform/normal RNGs with correct parameters and seeding.
  • →Convolutional layers — Correctly computing fan_in/fan_out for conv kernels prevents scale errors in CNNs.

Detailed Explanation

Tap terms for definitions

01Overview

Weight initialization strategies define how to choose the starting values of neural network parameters before training. Good initialization is crucial because neural networks are optimized with gradient-based methods that rely on stable signal propagation in both forward and backward passes. If initial weights are too large, activations and gradients can explode; if too small, they can vanish. Both issues slow learning or cause it to fail entirely. Two widely used strategies are Xavier (also called Glorot) initialization and He (also called Kaiming) initialization. Xavier aims to keep the variance of activations and gradients roughly constant across layers for activations with outputs centered around zero and without hard truncation (e.g., tanh, logistic sigmoid at moderate scales). He initialization modifies the scaling to account for ReLU-like activations that zero-out approximately half of their inputs, thereby halving variance unless compensated. These strategies provide closed-form formulas for the variance of the weight distribution, expressed in terms of fan_in (number of inputs) and fan_out (number of outputs) of a layer. They can be implemented using either uniform or normal distributions by matching the distribution's variance to the target formula. The result is faster convergence, more stable training, and reduced sensitivity to architecture depth.

02Intuition & Analogies

Imagine a network as a series of pipes carrying water (the signal). Each layer transforms and passes the flow onward. If the pipes suddenly narrow too much, flow dwindles to a trickle (vanishing activations/gradients). If they widen too much, water gushes uncontrollably (exploding activations/gradients). Proper weight initialization sets the initial pipe diameters so that flow is steady from the first layer to the last. Now think statistically: inputs to a neuron are many small random contributions (weights times activations). By the law of large numbers, their sum’s variance scales with how many terms you add and with the variance of each term. A layer with fan_in inputs adds up fan_in contributions. If each contribution has variance Var(W) × Var(input), then the sum’s variance is roughly fan_in × Var(W) × Var(input). To keep the signal’s magnitude stable across layers, we want the output variance to match the input variance. That determines how big Var(W) should be. ReLU adds a twist: it clips negative values to zero. If pre-activations are roughly symmetric around zero, ReLU keeps only positive halves. On average, that halves the variance going forward. He initialization compensates by doubling the pre-activation variance via choosing Var(W) = 2/fan_in, so after the ReLU’s halving, the variance ends up close to what it started as. For tanh/sigmoid, which are roughly symmetric and don’t zero out half the signal near the origin, the balanced choice is Var(W) = 2/(fan_in + fan_out), which keeps both forward activations and backward gradients in check. In short: choose the weight scale so the layer neither amplifies nor attenuates the flow of information.

03Formal Definition

Consider a feedforward layer z=W x + b with x ∈ Rfan_in, z ∈ Rfan_out. Assume entries of W are i.i.d., zero-mean, and independent of x, which itself has zero mean and variance Var(xi​) = vx​. Then each pre-activation component has variance Var(zj​) = ∑i=1fan_in​ Var(Wji​ xi​) = fan_in ⋅ Var(W) ⋅ vx​. After applying a nonlinearity a = ϕ(z), the variance changes according to the activation’s effect (e.g., ReLU halves variance under Gaussian assumptions). Xavier/Glorot initialization sets the weight variance so that forward and backward variances are balanced: Var(W) = fan_in+fan_out2​. This can be realized using a uniform distribution U[-a, a] with a = \sqrt{6/(fan_in+fan_out)] or a normal distribution with standard deviation σ = 2/(fan_in+fan_out)​. He/Kaiming initialization targets ReLU-like activations by preserving forward variance after the nonlinearity: Var(W) = fan_in2​. Implementations include a normal N(0, σ2) with σ = 2/fan_in​ or uniform U[-a, a] with a = 6/fan_in​. For leaky ReLU with negative slope α, a general gain g = 1+α22​​ adjusts σ = g/fan_in​. For convolutional layers, fan counts include kernel size: fan_in = Cin​ ⋅ kh​ ⋅ kw​ and fan_out = Cout​ ⋅ kh​ ⋅ kw​ (2D case).

04When to Use

Use Xavier/Glorot initialization when activations are approximately symmetric around zero and do not truncate half the distribution: tanh, linear, or modestly scaled sigmoid (often with batch normalization). It balances forward and backward variance through the term (fan_in + fan_out), which is helpful for deep fully connected networks with smooth nonlinearities. Use He/Kaiming initialization when activations are ReLU-like: ReLU, leaky ReLU, ELU, GELU (often treated similarly). These activations set many outputs to zero, so the variance needs to be larger to compensate; He initialization does exactly that with 2/fan_in. For leaky ReLU, use He with the gain factor g = \sqrt{2/(1+\alpha^2)}; for SELU, prefer LeCun normal (\operatorname{Var}(W) = 1/\text{fan_in}) to match its self-normalizing property. In convolutional networks, compute fan_in/fan_out carefully using kernel dimensions and channels. If batch normalization is present, initialization becomes slightly less critical but still matters for early training stability and speed. When experimenting with new activations, derive a gain that keeps \operatorname{Var}(a_l) roughly constant by analyzing E[\phi(z)] and E[\phi(z)^2] under z \sim \mathcal{N}(0, v).

⚠️Common Mistakes

• Mismatched activation and initializer: Using Xavier with ReLU can cause shrinking activations; using He with tanh can over-amplify. Always align the initializer with the nonlinearity’s variance effect. • Ignoring fan counts for convolutions: Forgetting to multiply by kernel area (k_h × k_w) leads to dramatically wrong scales. For 3D/1D convs, use the product of all spatial kernel sizes. • Confusing uniform and normal parameters: For uniform U[-a, a], the variance is a^2/3, not a^2. For normal, the standard deviation is \sigma, and the variance is \sigma^2. Match these to the target Var(W). • Forgetting bias initialization: Biases are often initialized to zero or small constants; applying the same scaling as weights can introduce unintended shifts. • Using integer RNGs or poor seeding: Weights should be floating-point, drawn from high-quality PRNGs. Seed deterministically for reproducibility when needed. • Ignoring gradient-side considerations: Xavier balances forward and backward; choosing 2/fan_in without ReLU-like nonlinearity may stabilize forward pass but destabilize gradients. • Reusing initializations after shape changes: If you change layer widths or kernel sizes, recompute fan_in/fan_out and reinitialize. • Overlooking normalization layers: BatchNorm can mask poor initialization but not fix extreme scales; still choose a principled initializer.

Key Formulas

Uniform Variance

Var(U[−a,a])=3a2​

Explanation: The variance of a uniform distribution from -a to a equals a2 divided by 3. This lets us match a uniform initializer’s variance to a target value by solving for a.

Glorot Normal Stddev

σ=fan_in+fan_out2​​

Explanation: For Xavier/Glorot normal, choose a zero-mean normal with standard deviation sqrt(2/(fani​n + fano​ut)). This preserves variance for tanh/sigmoid-like activations.

Glorot Uniform Bound

a=fan_in+fan_out6​​

Explanation: For Xavier/Glorot uniform, draw from U[-a, a] with a=sqrt(6/(fani​n+fano​ut)). This yields variance 2/(fani​n + fano​ut).

He Normal Stddev

σ=fan_in2​​

Explanation: For He/Kaiming normal, choose a zero-mean normal with standard deviation sqrt(2/fani​n). This compensates for ReLU halving the variance.

He Uniform Bound

a=fan_in6​​

Explanation: For He/Kaiming uniform, draw from U[-a, a] with a=sqrt(6/fani​n). The uniform’s variance a2/3 equals 2/fani​n.

Pre-activation Variance

Var(zj​)=i=1∑fan_in​Var(Wji​xi​)=fan_in⋅Var(W)⋅Var(x)

Explanation: Assuming independence and zero mean, each pre-activation component accumulates fani​n independent contributions. This relationship motivates how we scale Var(W).

ReLU Variance Halving

Var(ReLU(z))=21​Var(z)(z∼N(0,v))

Explanation: Under zero-mean Gaussian inputs, ReLU keeps only positive values, reducing variance by half on average. He initialization compensates for this effect.

Glorot Target Variance

Var(W)=fan_in+fan_out2​

Explanation: This target variance balances forward and backward signal magnitudes across layers for symmetric activations.

He Target Variance

Var(W)=fan_in2​

Explanation: This target variance preserves forward activation variance for ReLU-like activations that zero out about half the inputs.

Leaky ReLU Gain

gleaky ReLU​=1+α22​​

Explanation: The gain adjusts the initializer for leaky ReLU with negative slope α. Use σ = g/√fani​n for normal or set uniform bounds accordingly.

2D Convolution Fans

fan_in=Cin​⋅kh​⋅kw​,fan_out=Cout​⋅kh​⋅kw​

Explanation: For 2D convolutions, include both channel counts and kernel area when computing fani​n and fano​ut before applying Glorot/He formulas.

Complexity Analysis

Initializing weights draws one random number per parameter and writes it to memory. For a dense layer with dimensions (fano​ut × fani​n), both Xavier and He initialization run in O(fani​n × fano​ut) time, since each weight is independently sampled. For convolutional layers with weight tensor size Co​ut × Ci​n × kh​ × kw​, time is O(Co​ut × Ci​n × kh​ × kw​). The constants are small because sampling from std::uniformr​eal_distribution or std::normald​istribution is efficient. Space complexity is dominated by storing the parameters themselves. For a dense matrix W of size M × N, memory is O(MN). The initializer uses O(1) additional space beyond the output array (plus negligible space for the RNG state). Buffers used for computing statistics (e.g., sample mean/variance) are also O(1) if computed in a single pass. In practice, initialization is I/O-bound by memory writes for very large models. The computational cost of generating random numbers is usually smaller than the cost of touching every cache line in the parameter tensor. When comparing uniform vs normal initializers, normal sampling may be slightly slower due to transformation cost (e.g., Box–Muller or Ziggurat), but the asymptotic complexity remains linear in the number of parameters.

Code Examples

Glorot (Xavier) and He initializers for dense layers with sample variance check
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute sample mean and variance of a vector
5pair<double,double> mean_var(const vector<double>& v) {
6 double mean = 0.0; double M2 = 0.0; size_t n = 0;
7 for (double x : v) {
8 n++;
9 double delta = x - mean;
10 mean += delta / n;
11 double delta2 = x - mean;
12 M2 += delta * delta2;
13 }
14 double var = (n > 1) ? M2 / (n - 1) : 0.0; // unbiased estimate
15 return {mean, var};
16}
17
18// Initialize W with Glorot (Xavier) uniform: U[-a, a], a = sqrt(6/(fan_in+fan_out))
19void glorot_uniform(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
20 double a = sqrt(6.0 / (fan_in + fan_out));
21 uniform_real_distribution<double> dist(-a, a);
22 for (double &w : W) w = dist(rng);
23}
24
25// Initialize W with Glorot (Xavier) normal: N(0, sigma^2), sigma = sqrt(2/(fan_in+fan_out))
26void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
27 double sigma = sqrt(2.0 / (fan_in + fan_out));
28 normal_distribution<double> dist(0.0, sigma);
29 for (double &w : W) w = dist(rng);
30}
31
32// Initialize W with He (Kaiming) uniform: U[-a, a], a = sqrt(6/fan_in)
33void he_uniform(vector<double>& W, int fan_in, mt19937& rng) {
34 double a = sqrt(6.0 / fan_in);
35 uniform_real_distribution<double> dist(-a, a);
36 for (double &w : W) w = dist(rng);
37}
38
39// Initialize W with He (Kaiming) normal: N(0, sigma^2), sigma = sqrt(2/fan_in)
40void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
41 double sigma = sqrt(2.0 / fan_in);
42 normal_distribution<double> dist(0.0, sigma);
43 for (double &w : W) w = dist(rng);
44}
45
46int main() {
47 // Example: Dense layer 100 inputs -> 100 outputs
48 int fan_in = 100, fan_out = 100;
49 size_t num_params = static_cast<size_t>(fan_in) * fan_out;
50
51 // RNG
52 random_device rd;
53 mt19937 rng(rd());
54
55 // Buffers
56 vector<double> W(num_params);
57
58 // Glorot uniform
59 glorot_uniform(W, fan_in, fan_out, rng);
60 auto [m1, v1] = mean_var(W);
61 cout << "Glorot uniform: mean=" << m1 << ", var~=" << v1
62 << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
63
64 // Glorot normal
65 glorot_normal(W, fan_in, fan_out, rng);
66 auto [m2, v2] = mean_var(W);
67 cout << "Glorot normal: mean=" << m2 << ", var~=" << v2
68 << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
69
70 // He uniform
71 he_uniform(W, fan_in, rng);
72 auto [m3, v3] = mean_var(W);
73 cout << "He uniform: mean=" << m3 << ", var~=" << v3
74 << " (theory= " << 2.0 / fan_in << ")\n";
75
76 // He normal
77 he_normal(W, fan_in, rng);
78 auto [m4, v4] = mean_var(W);
79 cout << "He normal: mean=" << m4 << ", var~=" << v4
80 << " (theory= " << 2.0 / fan_in << ")\n";
81
82 return 0;
83}
84

This program implements Glorot and He initializers (both uniform and normal). It initializes a 100×100 dense layer four times and prints the sample mean/variance against the theoretical target. Minor deviations are expected due to finite sampling.

Time: O(fan_in × fan_out)Space: O(fan_in × fan_out)
Variance propagation through many layers: ReLU vs tanh with matching initializations
1#include <bits/stdc++.h>
2using namespace std;
3
4// Simple activations
5inline double relu(double x) { return x > 0.0 ? x : 0.0; }
6inline double tanh_act(double x) { return tanh(x); }
7
8typedef function<double(double)> ActFn;
9
10// Matrix-vector multiply: y = W x (W is rows x cols)
11void matvec(const vector<double>& W, int rows, int cols, const vector<double>& x, vector<double>& y) {
12 fill(y.begin(), y.end(), 0.0);
13 for (int r = 0; r < rows; ++r) {
14 double sum = 0.0;
15 const double* wrow = &W[static_cast<size_t>(r) * cols];
16 for (int c = 0; c < cols; ++c) sum += wrow[c] * x[c];
17 y[r] = sum;
18 }
19}
20
21// Initializers
22void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
23 double sigma = sqrt(2.0 / (fan_in + fan_out));
24 normal_distribution<double> dist(0.0, sigma);
25 for (double &w : W) w = dist(rng);
26}
27void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
28 double sigma = sqrt(2.0 / fan_in);
29 normal_distribution<double> dist(0.0, sigma);
30 for (double &w : W) w = dist(rng);
31}
32
33pair<double,double> mean_var(const vector<double>& v) {
34 double mean = 0.0; double M2 = 0.0; size_t n = 0;
35 for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
36 double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
37}
38
39// Build an L-layer MLP with width H and test forward variance stability
40void test_pipeline(int H, int L, ActFn act, bool use_he_for_relu, mt19937& rng) {
41 vector<vector<double>> weights(L);
42 // Square layers HxH for simplicity
43 for (int l = 0; l < L; ++l) {
44 weights[l].resize(static_cast<size_t>(H) * H);
45 if (use_he_for_relu) he_normal(weights[l], H, rng);
46 else glorot_normal(weights[l], H, H, rng);
47 }
48
49 // Random zero-mean unit-variance input
50 vector<double> x(H), z(H);
51 normal_distribution<double> dist_in(0.0, 1.0);
52 for (int i = 0; i < H; ++i) x[i] = dist_in(rng);
53
54 auto mv = mean_var(x);
55 cout << "Layer 0 (input): mean=" << mv.first << ", var=" << mv.second << "\n";
56
57 for (int l = 0; l < L; ++l) {
58 matvec(weights[l], H, H, x, z); // pre-activation
59 // Apply activation
60 for (int i = 0; i < H; ++i) z[i] = act(z[i]);
61 auto st = mean_var(z);
62 cout << "Layer " << (l+1) << ": mean=" << st.first << ", var=" << st.second << "\n";
63 x.swap(z);
64 }
65}
66
67int main() {
68 random_device rd; mt19937 rng(rd());
69 int H = 512; int L = 10;
70
71 cout << "--- ReLU with He (expected stable variance) ---\n";
72 test_pipeline(H, L, relu, /*use_he_for_relu=*/true, rng);
73
74 cout << "\n--- ReLU with Glorot (variance may shrink) ---\n";
75 test_pipeline(H, L, relu, /*use_he_for_relu=*/false, rng);
76
77 cout << "\n--- tanh with Glorot (expected stable near origin) ---\n";
78 test_pipeline(H, L, tanh_act, /*use_he_for_relu=*/false, rng);
79
80 return 0;
81}
82

This simulation builds a deep stack of square dense layers and compares how activation variance evolves when using matching vs mismatched initializations. ReLU paired with He shows roughly stable variance; ReLU with Glorot tends to shrink; tanh with Glorot is typically stable near the origin.

Time: O(L × H^2)Space: O(L × H^2)
He initializer for 2D convolutional kernels with correct fan counts
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute sample mean and variance
5pair<double,double> mean_var(const vector<double>& v) {
6 double mean = 0.0; double M2 = 0.0; size_t n = 0;
7 for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
8 double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
9}
10
11// He uniform for Conv2D weights: shape [out_channels, in_channels, k_h, k_w]
12void he_uniform_conv2d(vector<double>& W, int out_c, int in_c, int k_h, int k_w, mt19937& rng) {
13 long long fan_in = 1LL * in_c * k_h * k_w;
14 double a = sqrt(6.0 / static_cast<double>(fan_in));
15 uniform_real_distribution<double> dist(-a, a);
16 for (double &w : W) w = dist(rng);
17}
18
19int main() {
20 int out_c = 64, in_c = 3, k_h = 3, k_w = 3;
21 size_t num_params = static_cast<size_t>(out_c) * in_c * k_h * k_w;
22 vector<double> W(num_params);
23
24 random_device rd; mt19937 rng(rd());
25 he_uniform_conv2d(W, out_c, in_c, k_h, k_w, rng);
26
27 auto [m, v] = mean_var(W);
28 double fan_in = static_cast<double>(in_c * k_h * k_w);
29 cout << "Conv2D He uniform: mean=" << m << ", var~=" << v
30 << ", theory=" << 2.0 / fan_in << "\n";
31
32 return 0;
33}
34

This example initializes a 2D convolutional kernel tensor using He uniform. It computes fan_in = in_channels × k_h × k_w and matches the variance to 2/fan_in. The printed statistics should be close to theory for sufficiently large tensors.

Time: O(out_c × in_c × k_h × k_w)Space: O(out_c × in_c × k_h × k_w)
#xavier#glorot#he#kaiming#weight initialization#fan_in#fan_out#relu#tanh#variance preservation#neural networks#convolution#normal distribution#uniform distribution#gradient stability