📚TheoryIntermediate

Weight Initialization Strategies

Key Points

•
Weight initialization sets the starting values of neural network parameters so signals and gradients neither explode nor vanish as they pass through layers.
•
Xavier/Glorot initialization preserves variance for symmetric activations like tanh/sigmoid by using variance 2/(fa $n_{i} n$ + fa $n_{o} u t$ ).
•
He/Kaiming initialization preserves variance for ReLU-family activations by using variance 2/fa $n_{i} n$ to counteract ReLU’s 50% zeroing.
•
fa $n_{i} n$ is the number of inputs to a neuron and fa $n_{o} u t$ is the number of outputs it feeds; both determine the correct scaling of initial weights.
•
Uniform and normal versions exist; their ranges/standard deviations are chosen so the weight variance matches the target formula.
•
Using an initialization mismatched to the activation (e.g., Xavier with ReLU) can lead to vanishing or exploding activations/gradients.
•
For convolutional layers, fa $n_{i} n$ and fa $n_{o} u t$ must include kernel size and channel counts (e.g., i $n_{c} hann e l s$ × $k_{h}$ × $k_{w}$ ).
•
Initialization cost is linear in the number of parameters, and memory use is dominated by storing the weights.

Prerequisites

→Basic probability and variance — Understanding how sums of random variables combine variances is central to deriving initialization scales.
→Linear algebra (vectors, matrices) — Layers compute z = W x + b; reasoning about dimensions and matrix operations requires matrix fundamentals.
→Neural network activation functions — Initialization depends on how the activation changes variance (e.g., ReLU vs tanh).
→Gradient-based optimization — Balanced initialization stabilizes gradient magnitudes, improving training with backpropagation.
→Random number generation — Practical implementation uses uniform/normal RNGs with correct parameters and seeding.
→Convolutional layers — Correctly computing fan_in/fan_out for conv kernels prevents scale errors in CNNs.

Detailed Explanation

Tap terms for definitions

01Overview

Weight initialization strategies define how to choose the starting values of neural network parameters before training. Good initialization is crucial because neural networks are optimized with gradient-based methods that rely on stable signal propagation in both forward and backward passes. If initial weights are too large, activations and gradients can explode; if too small, they can vanish. Both issues slow learning or cause it to fail entirely. Two widely used strategies are Xavier (also called Glorot) initialization and He (also called Kaiming) initialization. Xavier aims to keep the variance of activations and gradients roughly constant across layers for activations with outputs centered around zero and without hard truncation (e.g., tanh, logistic sigmoid at moderate scales). He initialization modifies the scaling to account for ReLU-like activations that zero-out approximately half of their inputs, thereby halving variance unless compensated. These strategies provide closed-form formulas for the variance of the weight distribution, expressed in terms of fa $n_i$ n (number of inputs) and fa $n_o$ ut (number of outputs) of a layer. They can be implemented using either uniform or normal distributions by matching the distribution's variance to the target formula. The result is faster convergence, more stable training, and reduced sensitivity to architecture depth.

02Intuition & Analogies

Imagine a network as a series of pipes carrying water (the signal). Each layer transforms and passes the flow onward. If the pipes suddenly narrow too much, flow dwindles to a trickle (vanishing activations/gradients). If they widen too much, water gushes uncontrollably (exploding activations/gradients). Proper weight initialization sets the initial pipe diameters so that flow is steady from the first layer to the last. Now think statistically: inputs to a neuron are many small random contributions (weights times activations). By the law of large numbers, their sum’s variance scales with how many terms you add and with the variance of each term. A layer with fa $n_i$ n inputs adds up fa $n_i$ n contributions. If each contribution has variance Var( $W) × Var(input$ ), then the sum’s variance is roughly fa $n_i$ n × Var( $W) × Var(input$ ). To keep the signal’s magnitude stable across layers, we want the output variance to match the input variance. That determines how big Var(W) should be. ReLU adds a twist: it clips negative values to zero. If pre-activations are roughly symmetric around zero, ReLU keeps only positive halves. On average, that halves the variance going forward. He initialization compensates by doubling the pre-activation variance via choosing Var(W) = 2/fa $n_i$ n, so after the ReLU’s halving, the variance ends up close to what it started as. For tanh/sigmoid, which are roughly symmetric and don’t zero out half the signal near the origin, the balanced choice is Var(W) = 2/(fa $n_i$ n + fa $n_o$ ut), which keeps both forward activations and backward gradients in check. In short: choose the weight scale so the layer neither amplifies nor attenuates the flow of information.

03Formal Definition

Consider a feedforward layer

z = W

x + b with x

\in

R^{fan_in}

, z

\in

R^{fan_out}

. Assume entries of W are i.i.d., zero-mean, and independent of x, which itself has zero mean and variance

Var

(

x_{i}

) =

v_{x}

. Then each pre-activation component has variance

Var

(

z_{j}

) =

\sum_{i = 1}^{fan_in}

Var

(

W_{ji}

x_{i}

) =

fan_in

\cdot

Var

(W)

\cdot

v_{x}

. After applying a nonlinearity a =

ϕ

(z), the variance changes according to the activation’s effect (e.g., ReLU halves variance under Gaussian assumptions). Xavier/Glorot initialization sets the weight variance so that forward and backward variances are balanced:

Var

(W) =

\frac{2}{fan_in + fan_out}

. This can be realized using a uniform distribution U[-a, a] with a =

\sqrt

{6/(

fan_in

fan_out

)] or a normal distribution with standard deviation

σ

2/ (fan_in + fan_out)

. He/Kaiming initialization targets ReLU-like activations by preserving forward variance after the nonlinearity:

Var

(W) =

\frac{2}{fan_in}

. Implementations include a normal

N

(0,

σ^{2}

) with

σ

2/ fan_in

or uniform U[-a, a] with a =

6/ fan_in

. For leaky ReLU with negative slope

α

, a general gain g =

\frac{2}{1 + α ^{2}}

adjusts

σ

= g/

fan_in

. For convolutional layers, fan counts include kernel size:

fan_in

C_{in}

\cdot

k_{h}

\cdot

k_{w}

and

fan_out

C_{o u t}

\cdot

k_{h}

\cdot

k_{w}

(2D case).

04When to Use

Use Xavier/Glorot initialization when activations are approximately symmetric around zero and do not truncate half the distribution: tanh, linear, or modestly scaled sigmoid (often with batch normalization). It balances forward and backward variance through the term (fa $n_i$ n + fa $n_o$ ut), which is helpful for deep fully connected networks with smooth nonlinearities. Use He/Kaiming initialization when activations are ReLU-like: ReLU, leaky ReLU, ELU, GELU (often treated similarly). These activations set many outputs to zero, so the variance needs to be larger to compensate; He initialization does exactly that with 2/fa $n_i$ n. For leaky ReLU, use He with the gain factor g = $\sqrt{2/(1+\alph$ a^2 $)}$ ; for SELU, prefer LeCun normal (\operatorname{Var}(W) = 1/ $\text{fan\_in}$ ) to match its self-normalizing property. In convolutional networks, compute fa $n_i$ n/fa $n_o$ ut carefully using kernel dimensions and channels. If batch normalization is present, initialization becomes slightly less critical but still matters for early training stability and speed. When experimenting with new activations, derive a gain that keeps \operatorname{Var}( $a_l$ ) roughly constant by analyzing E[\phi(z)] and E[\phi(z)^2] under z \sim \mathcal{N}(0, v).

⚠️Common Mistakes

• Mismatched activation and initializer: Using Xavier with ReLU can cause shrinking activations; using He with tanh can over-amplify. Always align the initializer with the nonlinearity’s variance effect. • Ignoring fan counts for convolutions: Forgetting to multiply by kernel area ( $k_h$ × $k_w$ ) leads to dramatically wrong scales. For 3D/1D convs, use the product of all spatial kernel sizes. • Confusing uniform and normal parameters: For uniform U $\begin{pmatrix} -a \\ a \end{pmatrix}$ , the variance is $a^2$ /3, not $a^2$ . For normal, the standard deviation is \sigma, and the variance is \sigm $a^2$ . Match these to the target Var(W). • Forgetting bias initialization: Biases are often initialized to zero or small constants; applying the same scaling as weights can introduce unintended shifts. • Using integer RNGs or poor seeding: Weights should be floating-point, drawn from high-quality PRNGs. Seed deterministically for reproducibility when needed. • Ignoring gradient-side considerations: Xavier balances forward and backward; choosing 2/fa $n_i$ n without ReLU-like nonlinearity may stabilize forward pass but destabilize gradients. • Reusing initializations after shape changes: If you change layer widths or kernel sizes, recompute fa $n_i$ n/fa $n_o$ ut and reinitialize. • Overlooking normalization layers: BatchNorm can mask poor initialization but not fix extreme scales; still choose a principled initializer.

Key Formulas

Uniform Variance

Var (U [- a, a]) = \frac{a ^{2}}{3}

Explanation: The variance of a uniform distribution from -a to a equals $a^{2}$ divided by 3. This lets us match a uniform initializer’s variance to a target value by solving for a.

Glorot Normal Stddev

σ = \frac{2}{fan_in + fan_out}

Explanation: For Xavier/Glorot normal, choose a zero-mean normal with standard deviation sqrt(2/(fa $n_{i} n$ + fa $n_{o} u t$ )). This preserves variance for tanh/sigmoid-like activations.

Glorot Uniform Bound

a = \frac{6}{fan_in + fan_out}

Explanation: For Xavier/Glorot uniform, draw from U[-a, a] with $a = s q r t (6/ (f a n_{i} n + f a n_{o} u t$ )). This yields variance 2/(fa $n_{i} n$ + fa $n_{o} u t$ ).

He Normal Stddev

σ = \frac{2}{fan_in}

Explanation: For He/Kaiming normal, choose a zero-mean normal with standard deviation sqrt(2/fa $n_{i} n$ ). This compensates for ReLU halving the variance.

He Uniform Bound

a = \frac{6}{fan_in}

Explanation: For He/Kaiming uniform, draw from U[-a, a] with $a = s q r t (6/ f a n_{i} n$ ). The uniform’s variance $a^{2}$ /3 equals 2/fa $n_{i} n$ .

Pre-activation Variance

Var (z_{j}) = i = 1 \sum fan_in Var (W_{ji} x_{i}) = fan_in \cdot Var (W) \cdot Var (x)

Explanation: Assuming independence and zero mean, each pre-activation component accumulates fa $n_{i} n$ independent contributions. This relationship motivates how we scale Var(W).

ReLU Variance Halving

Var (ReLU (z)) = \frac{1}{2} Var (z) (z \sim N (0, v))

Explanation: Under zero-mean Gaussian inputs, ReLU keeps only positive values, reducing variance by half on average. He initialization compensates for this effect.

Glorot Target Variance

Var (W) = \frac{2}{fan_in + fan_out}

Explanation: This target variance balances forward and backward signal magnitudes across layers for symmetric activations.

He Target Variance

Var (W) = \frac{2}{fan_in}

Explanation: This target variance preserves forward activation variance for ReLU-like activations that zero out about half the inputs.

Leaky ReLU Gain

g_{leaky ReLU} = \frac{2}{1 + α ^{2}}

Explanation: The gain adjusts the initializer for leaky ReLU with negative slope $α .$ Use σ = g/√fa $n_{i} n$ for normal or set uniform bounds accordingly.

2D Convolution Fans

fan_in = C_{in} \cdot k_{h} \cdot k_{w}, fan_out = C_{o u t} \cdot k_{h} \cdot k_{w}

Explanation: For 2D convolutions, include both channel counts and kernel area when computing fa $n_{i} n$ and fa $n_{o} u t$ before applying Glorot/He formulas.

Complexity Analysis

Initializing weights draws one random number per parameter and writes it to memory. For a dense layer with dimensions (fa

n_{o} u t

× fa

n_{i} n

), both Xavier and He initialization run in O(fa

n_{i} n

× fa

n_{o} u t

) time, since each weight is independently sampled. For convolutional layers with weight tensor size

C_{o} u t

C_{i} n

k_{h}

k_{w}

, time is O(

C_{o} u t

C_{i} n

k_{h}

k_{w}

). The constants are small because sampling from std::unifor

m_{r} e a l

_distribution or std::norma

l_{d} i s t r ib u t i o n

is efficient. Space complexity is dominated by storing the parameters themselves. For a dense matrix W of size M × N, memory is O(MN). The initializer uses O(1) additional space beyond the output array (plus negligible space for the RNG state). Buffers used for computing statistics (e.g., sample mean/variance) are also O(1) if computed in a single pass. In practice, initialization is I/O-bound by memory writes for very large models. The computational cost of generating random numbers is usually smaller than the cost of touching every cache line in the parameter tensor. When comparing uniform vs normal initializers, normal sampling may be slightly slower due to transformation cost (e.g., Box–Muller or Ziggurat), but the asymptotic complexity remains linear in the number of parameters.

Code Examples

Glorot (Xavier) and He initializers for dense layers with sample variance check

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute sample mean and variance of a vector
5 pair<double,double> mean_var(const vector<double>& v) {
6     double mean = 0.0; double M2 = 0.0; size_t n = 0;
7     for (double x : v) {
8         n++;
9         double delta = x - mean;
10         mean += delta / n;
11         double delta2 = x - mean;
12         M2 += delta * delta2;
13     }
14     double var = (n > 1) ? M2 / (n - 1) : 0.0; // unbiased estimate
15     return {mean, var};
16 }
17 
18 // Initialize W with Glorot (Xavier) uniform: U[-a, a], a = sqrt(6/(fan_in+fan_out))
19 void glorot_uniform(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
20     double a = sqrt(6.0 / (fan_in + fan_out));
21     uniform_real_distribution<double> dist(-a, a);
22     for (double &w : W) w = dist(rng);
23 }
24 
25 // Initialize W with Glorot (Xavier) normal: N(0, sigma^2), sigma = sqrt(2/(fan_in+fan_out))
26 void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
27     double sigma = sqrt(2.0 / (fan_in + fan_out));
28     normal_distribution<double> dist(0.0, sigma);
29     for (double &w : W) w = dist(rng);
30 }
31 
32 // Initialize W with He (Kaiming) uniform: U[-a, a], a = sqrt(6/fan_in)
33 void he_uniform(vector<double>& W, int fan_in, mt19937& rng) {
34     double a = sqrt(6.0 / fan_in);
35     uniform_real_distribution<double> dist(-a, a);
36     for (double &w : W) w = dist(rng);
37 }
38 
39 // Initialize W with He (Kaiming) normal: N(0, sigma^2), sigma = sqrt(2/fan_in)
40 void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
41     double sigma = sqrt(2.0 / fan_in);
42     normal_distribution<double> dist(0.0, sigma);
43     for (double &w : W) w = dist(rng);
44 }
45 
46 int main() {
47     // Example: Dense layer 100 inputs -> 100 outputs
48     int fan_in = 100, fan_out = 100;
49     size_t num_params = static_cast<size_t>(fan_in) * fan_out;
50 
51     // RNG
52     random_device rd; 
53     mt19937 rng(rd());
54 
55     // Buffers
56     vector<double> W(num_params);
57 
58     // Glorot uniform
59     glorot_uniform(W, fan_in, fan_out, rng);
60     auto [m1, v1] = mean_var(W);
61     cout << "Glorot uniform:   mean=" << m1 << ", var~=" << v1
62          << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
63 
64     // Glorot normal
65     glorot_normal(W, fan_in, fan_out, rng);
66     auto [m2, v2] = mean_var(W);
67     cout << "Glorot normal:    mean=" << m2 << ", var~=" << v2
68          << " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
69 
70     // He uniform
71     he_uniform(W, fan_in, rng);
72     auto [m3, v3] = mean_var(W);
73     cout << "He uniform:       mean=" << m3 << ", var~=" << v3
74          << " (theory= " << 2.0 / fan_in << ")\n";
75 
76     // He normal
77     he_normal(W, fan_in, rng);
78     auto [m4, v4] = mean_var(W);
79     cout << "He normal:        mean=" << m4 << ", var~=" << v4
80          << " (theory= " << 2.0 / fan_in << ")\n";
81 
82     return 0;
83 }
84

This program implements Glorot and He initializers (both uniform and normal). It initializes a 100×100 dense layer four times and prints the sample mean/variance against the theoretical target. Minor deviations are expected due to finite sampling.

Time: O(fan_in × fan_out)Space: O(fan_in × fan_out)

Variance propagation through many layers: ReLU vs tanh with matching initializations

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple activations
5 inline double relu(double x) { return x > 0.0 ? x : 0.0; }
6 inline double tanh_act(double x) { return tanh(x); }
7 
8 typedef function<double(double)> ActFn;
9 
10 // Matrix-vector multiply: y = W x (W is rows x cols)
11 void matvec(const vector<double>& W, int rows, int cols, const vector<double>& x, vector<double>& y) {
12     fill(y.begin(), y.end(), 0.0);
13     for (int r = 0; r < rows; ++r) {
14         double sum = 0.0;
15         const double* wrow = &W[static_cast<size_t>(r) * cols];
16         for (int c = 0; c < cols; ++c) sum += wrow[c] * x[c];
17         y[r] = sum;
18     }
19 }
20 
21 // Initializers
22 void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
23     double sigma = sqrt(2.0 / (fan_in + fan_out));
24     normal_distribution<double> dist(0.0, sigma);
25     for (double &w : W) w = dist(rng);
26 }
27 void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
28     double sigma = sqrt(2.0 / fan_in);
29     normal_distribution<double> dist(0.0, sigma);
30     for (double &w : W) w = dist(rng);
31 }
32 
33 pair<double,double> mean_var(const vector<double>& v) {
34     double mean = 0.0; double M2 = 0.0; size_t n = 0;
35     for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
36     double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
37 }
38 
39 // Build an L-layer MLP with width H and test forward variance stability
40 void test_pipeline(int H, int L, ActFn act, bool use_he_for_relu, mt19937& rng) {
41     vector<vector<double>> weights(L);
42     // Square layers HxH for simplicity
43     for (int l = 0; l < L; ++l) {
44         weights[l].resize(static_cast<size_t>(H) * H);
45         if (use_he_for_relu) he_normal(weights[l], H, rng);
46         else glorot_normal(weights[l], H, H, rng);
47     }
48 
49     // Random zero-mean unit-variance input
50     vector<double> x(H), z(H);
51     normal_distribution<double> dist_in(0.0, 1.0);
52     for (int i = 0; i < H; ++i) x[i] = dist_in(rng);
53 
54     auto mv = mean_var(x);
55     cout << "Layer 0 (input): mean=" << mv.first << ", var=" << mv.second << "\n";
56 
57     for (int l = 0; l < L; ++l) {
58         matvec(weights[l], H, H, x, z); // pre-activation
59         // Apply activation
60         for (int i = 0; i < H; ++i) z[i] = act(z[i]);
61         auto st = mean_var(z);
62         cout << "Layer " << (l+1) << ": mean=" << st.first << ", var=" << st.second << "\n";
63         x.swap(z);
64     }
65 }
66 
67 int main() {
68     random_device rd; mt19937 rng(rd());
69     int H = 512; int L = 10;
70 
71     cout << "--- ReLU with He (expected stable variance) ---\n";
72     test_pipeline(H, L, relu, /*use_he_for_relu=*/true, rng);
73 
74     cout << "\n--- ReLU with Glorot (variance may shrink) ---\n";
75     test_pipeline(H, L, relu, /*use_he_for_relu=*/false, rng);
76 
77     cout << "\n--- tanh with Glorot (expected stable near origin) ---\n";
78     test_pipeline(H, L, tanh_act, /*use_he_for_relu=*/false, rng);
79 
80     return 0;
81 }
82

This simulation builds a deep stack of square dense layers and compares how activation variance evolves when using matching vs mismatched initializations. ReLU paired with He shows roughly stable variance; ReLU with Glorot tends to shrink; tanh with Glorot is typically stable near the origin.

Time: O(L × H^2)Space: O(L × H^2)

He initializer for 2D convolutional kernels with correct fan counts

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Compute sample mean and variance
5 pair<double,double> mean_var(const vector<double>& v) {
6     double mean = 0.0; double M2 = 0.0; size_t n = 0;
7     for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
8     double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
9 }
10 
11 // He uniform for Conv2D weights: shape [out_channels, in_channels, k_h, k_w]
12 void he_uniform_conv2d(vector<double>& W, int out_c, int in_c, int k_h, int k_w, mt19937& rng) {
13     long long fan_in = 1LL * in_c * k_h * k_w;
14     double a = sqrt(6.0 / static_cast<double>(fan_in));
15     uniform_real_distribution<double> dist(-a, a);
16     for (double &w : W) w = dist(rng);
17 }
18 
19 int main() {
20     int out_c = 64, in_c = 3, k_h = 3, k_w = 3;
21     size_t num_params = static_cast<size_t>(out_c) * in_c * k_h * k_w;
22     vector<double> W(num_params);
23 
24     random_device rd; mt19937 rng(rd());
25     he_uniform_conv2d(W, out_c, in_c, k_h, k_w, rng);
26 
27     auto [m, v] = mean_var(W);
28     double fan_in = static_cast<double>(in_c * k_h * k_w);
29     cout << "Conv2D He uniform: mean=" << m << ", var~=" << v
30          << ", theory=" << 2.0 / fan_in << "\n";
31 
32     return 0;
33 }
34

This example initializes a 2D convolutional kernel tensor using He uniform. It computes fan_in = in_channels × k_h × k_w and matches the variance to 2/fan_in. The printed statistics should be close to theory for sufficiently large tensors.

Time: O(out_c × in_c × k_h × k_w)Space: O(out_c × in_c × k_h × k_w)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute sample mean and variance of a vector
5	pair<double,double> mean_var(const vector<double>& v) {
6	double mean = 0.0; double M2 = 0.0; size_t n = 0;
7	for (double x : v) {
8	n++;
9	double delta = x - mean;
10	mean += delta / n;
11	double delta2 = x - mean;
12	M2 += delta * delta2;
13	}
14	double var = (n > 1) ? M2 / (n - 1) : 0.0; // unbiased estimate
15	return {mean, var};
16	}
17
18	// Initialize W with Glorot (Xavier) uniform: U[-a, a], a = sqrt(6/(fan_in+fan_out))
19	void glorot_uniform(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
20	double a = sqrt(6.0 / (fan_in + fan_out));
21	uniform_real_distribution<double> dist(-a, a);
22	for (double &w : W) w = dist(rng);
23	}
24
25	// Initialize W with Glorot (Xavier) normal: N(0, sigma^2), sigma = sqrt(2/(fan_in+fan_out))
26	void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
27	double sigma = sqrt(2.0 / (fan_in + fan_out));
28	normal_distribution<double> dist(0.0, sigma);
29	for (double &w : W) w = dist(rng);
30	}
31
32	// Initialize W with He (Kaiming) uniform: U[-a, a], a = sqrt(6/fan_in)
33	void he_uniform(vector<double>& W, int fan_in, mt19937& rng) {
34	double a = sqrt(6.0 / fan_in);
35	uniform_real_distribution<double> dist(-a, a);
36	for (double &w : W) w = dist(rng);
37	}
38
39	// Initialize W with He (Kaiming) normal: N(0, sigma^2), sigma = sqrt(2/fan_in)
40	void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
41	double sigma = sqrt(2.0 / fan_in);
42	normal_distribution<double> dist(0.0, sigma);
43	for (double &w : W) w = dist(rng);
44	}
45
46	int main() {
47	// Example: Dense layer 100 inputs -> 100 outputs
48	int fan_in = 100, fan_out = 100;
49	size_t num_params = static_cast<size_t>(fan_in) * fan_out;
50
51	// RNG
52	random_device rd;
53	mt19937 rng(rd());
54
55	// Buffers
56	vector<double> W(num_params);
57
58	// Glorot uniform
59	glorot_uniform(W, fan_in, fan_out, rng);
60	auto [m1, v1] = mean_var(W);
61	cout << "Glorot uniform: mean=" << m1 << ", var~=" << v1
62	<< " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
63
64	// Glorot normal
65	glorot_normal(W, fan_in, fan_out, rng);
66	auto [m2, v2] = mean_var(W);
67	cout << "Glorot normal: mean=" << m2 << ", var~=" << v2
68	<< " (theory= " << 2.0 / (fan_in + fan_out) << ")\n";
69
70	// He uniform
71	he_uniform(W, fan_in, rng);
72	auto [m3, v3] = mean_var(W);
73	cout << "He uniform: mean=" << m3 << ", var~=" << v3
74	<< " (theory= " << 2.0 / fan_in << ")\n";
75
76	// He normal
77	he_normal(W, fan_in, rng);
78	auto [m4, v4] = mean_var(W);
79	cout << "He normal: mean=" << m4 << ", var~=" << v4
80	<< " (theory= " << 2.0 / fan_in << ")\n";
81
82	return 0;
83	}
84

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple activations
5	inline double relu(double x) { return x > 0.0 ? x : 0.0; }
6	inline double tanh_act(double x) { return tanh(x); }
7
8	typedef function<double(double)> ActFn;
9
10	// Matrix-vector multiply: y = W x (W is rows x cols)
11	void matvec(const vector<double>& W, int rows, int cols, const vector<double>& x, vector<double>& y) {
12	fill(y.begin(), y.end(), 0.0);
13	for (int r = 0; r < rows; ++r) {
14	double sum = 0.0;
15	const double* wrow = &W[static_cast<size_t>(r) * cols];
16	for (int c = 0; c < cols; ++c) sum += wrow[c] * x[c];
17	y[r] = sum;
18	}
19	}
20
21	// Initializers
22	void glorot_normal(vector<double>& W, int fan_in, int fan_out, mt19937& rng) {
23	double sigma = sqrt(2.0 / (fan_in + fan_out));
24	normal_distribution<double> dist(0.0, sigma);
25	for (double &w : W) w = dist(rng);
26	}
27	void he_normal(vector<double>& W, int fan_in, mt19937& rng) {
28	double sigma = sqrt(2.0 / fan_in);
29	normal_distribution<double> dist(0.0, sigma);
30	for (double &w : W) w = dist(rng);
31	}
32
33	pair<double,double> mean_var(const vector<double>& v) {
34	double mean = 0.0; double M2 = 0.0; size_t n = 0;
35	for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
36	double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
37	}
38
39	// Build an L-layer MLP with width H and test forward variance stability
40	void test_pipeline(int H, int L, ActFn act, bool use_he_for_relu, mt19937& rng) {
41	vector<vector<double>> weights(L);
42	// Square layers HxH for simplicity
43	for (int l = 0; l < L; ++l) {
44	weights[l].resize(static_cast<size_t>(H) * H);
45	if (use_he_for_relu) he_normal(weights[l], H, rng);
46	else glorot_normal(weights[l], H, H, rng);
47	}
48
49	// Random zero-mean unit-variance input
50	vector<double> x(H), z(H);
51	normal_distribution<double> dist_in(0.0, 1.0);
52	for (int i = 0; i < H; ++i) x[i] = dist_in(rng);
53
54	auto mv = mean_var(x);
55	cout << "Layer 0 (input): mean=" << mv.first << ", var=" << mv.second << "\n";
56
57	for (int l = 0; l < L; ++l) {
58	matvec(weights[l], H, H, x, z); // pre-activation
59	// Apply activation
60	for (int i = 0; i < H; ++i) z[i] = act(z[i]);
61	auto st = mean_var(z);
62	cout << "Layer " << (l+1) << ": mean=" << st.first << ", var=" << st.second << "\n";
63	x.swap(z);
64	}
65	}
66
67	int main() {
68	random_device rd; mt19937 rng(rd());
69	int H = 512; int L = 10;
70
71	cout << "--- ReLU with He (expected stable variance) ---\n";
72	test_pipeline(H, L, relu, /use_he_for_relu=/true, rng);
73
74	cout << "\n--- ReLU with Glorot (variance may shrink) ---\n";
75	test_pipeline(H, L, relu, /use_he_for_relu=/false, rng);
76
77	cout << "\n--- tanh with Glorot (expected stable near origin) ---\n";
78	test_pipeline(H, L, tanh_act, /use_he_for_relu=/false, rng);
79
80	return 0;
81	}
82

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Compute sample mean and variance
5	pair<double,double> mean_var(const vector<double>& v) {
6	double mean = 0.0; double M2 = 0.0; size_t n = 0;
7	for (double x : v) { n++; double d = x - mean; mean += d/n; M2 += d*(x-mean); }
8	double var = (n>1) ? M2/(n-1) : 0.0; return {mean, var};
9	}
10
11	// He uniform for Conv2D weights: shape [out_channels, in_channels, k_h, k_w]
12	void he_uniform_conv2d(vector<double>& W, int out_c, int in_c, int k_h, int k_w, mt19937& rng) {
13	long long fan_in = 1LL * in_c * k_h * k_w;
14	double a = sqrt(6.0 / static_cast<double>(fan_in));
15	uniform_real_distribution<double> dist(-a, a);
16	for (double &w : W) w = dist(rng);
17	}
18
19	int main() {
20	int out_c = 64, in_c = 3, k_h = 3, k_w = 3;
21	size_t num_params = static_cast<size_t>(out_c) * in_c * k_h * k_w;
22	vector<double> W(num_params);
23
24	random_device rd; mt19937 rng(rd());
25	he_uniform_conv2d(W, out_c, in_c, k_h, k_w, rng);
26
27	auto [m, v] = mean_var(W);
28	double fan_in = static_cast<double>(in_c * k_h * k_w);
29	cout << "Conv2D He uniform: mean=" << m << ", var~=" << v
30	<< ", theory=" << 2.0 / fan_in << "\n";
31
32	return 0;
33	}
34