📚TheoryIntermediate

Spectral Normalization

Key Points

•
Spectral normalization rescales a weight matrix so its largest singular value (spectral norm) is at most a target value, typically 1.
•
It controls a layer’s Lipschitz constant, helping stabilize training and prevent exploding activations or gradients.
•
The spectral norm equals the maximum stretch a matrix applies to any vector, and it is the largest singular value of the matrix.
•
A fast way to approximate it is power iteration, which alternates multiplying by W and $W^{T}$ and normalizing vectors.
•
In practice you keep a running estimate of the top singular vectors to make updates cheap during training.
•
For convolutional kernels, reshape the 4D tensor to a 2D matrix (ou $t_{c} hann e l s$ by i $n_{c} hann e l s$ × kerne $l_{s} i ze$ ) before normalizing.
•
Spectral normalization differs from weight normalization and batch normalization; it constrains operator norm, not per-weight magnitude or per-batch statistics.
•
The time cost per update using k power iterations is O(kmn) for an m×n matrix, which is much cheaper than a full SVD.

Prerequisites

→Vectors and matrices — Understanding matrix–vector multiplication and basic linear operations is essential for spectral norms and power iteration.
→Matrix and vector norms — Spectral normalization constrains an operator norm; knowing different norms clarifies why spectral norm specifically bounds worst-case stretch.
→Singular Value Decomposition (SVD) — The spectral norm equals the largest singular value; SVD provides the theoretical foundation and properties of singular vectors.
→Numerical linear algebra basics — Power iteration, convergence, and numerical stability require familiarity with iterative algorithms and floating-point issues.
→C++ programming basics — Implementing and using spectral normalization requires competence with arrays, loops, and simple class structures.
→Neural network layers and training — Motivation and correct placement of spectral normalization depend on how layers compose and how training updates weights.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral normalization is a technique that rescales a neural network layer’s weight matrix so that its spectral norm—the largest singular value—does not exceed a chosen bound (often 1). Intuitively, the spectral norm measures the maximum factor by which the matrix can stretch any input vector. By bounding this stretch, spectral normalization controls the Lipschitz constant of the layer, which in turn helps stabilize optimization, prevents exploding activations or gradients, and often improves generalization. Originally popularized for stabilizing Generative Adversarial Networks (GANs), spectral normalization applies broadly to deep networks that benefit from controlled layer sensitivity. A direct computation of the spectral norm via singular value decomposition (SVD) can be expensive, so practitioners typically approximate it efficiently with a few steps of power iteration, which only requires matrix–vector multiplications with W and its transpose W^T. The weight matrix is then divided by the estimated largest singular value to enforce the bound. This approach differs from techniques like weight decay or batch normalization. Weight decay penalizes sum-of-squares of parameters, and batch normalization normalizes activations using batch statistics, whereas spectral normalization directly constrains the operator norm of the linear transformation itself. The result is a principled cap on how much a layer can amplify inputs, yielding more predictable network behavior.

02Intuition & Analogies

Imagine a stretchy fabric grid drawn on a table. Each point on the fabric has coordinates (like a vector). When you apply a linear transformation (your weight matrix), you’re grabbing the fabric and pulling, twisting, and possibly skewing it. Some directions might barely move, while one special direction might get stretched the most. That maximum stretch factor is the spectral norm. If you let that stretch grow unchecked, tiny imperfections or noise can be amplified enormously, causing instability. Spectral normalization is like putting a rule on how much you’re allowed to stretch the fabric in any direction—say, no more than 1. The fabric can still rotate, flip, or slightly stretch, but there’s a hard limit on the worst-case expansion. In a neural network, this limit translates into a bound on how sensitive a layer’s output can be to changes in its input. The trick to finding how stretchy the fabric is in the worst direction is power iteration. Think of repeatedly pushing a stick (a vector) through the fabric transformation and straightening it. Each push aligns the stick closer to the most-stretched direction. After a few pushes, you also measure how much longer the stick gets—this approximates the largest singular value. Once you know that number, you simply scale the transformation down so the worst-case stretch is at your desired cap. By keeping that cap tight across layers, the whole network behaves more predictably, like a machine where every gear’s torque is bounded so the system can’t suddenly lurch out of control.

03Formal Definition

For a real matrix W

\in

R^{m \times n}

, the spectral norm is defined as \

∣ W ∥

_{2} =

σ_{m a x}

(W), the largest singular value of W. Equivalently, it is the induced operator norm of W with respect to the Euclidean norm: \

∣ W ∥

_{2} =

sup_{x \neq = 0}

\frac{∥ W x ∥ _{2}}{∥ x ∥ _{2}}

. A linear layer f(x) = Wx + b is L-Lipschitz with L = \

∣ W ∥

_{2}. Spectral normalization enforces a constraint \

∣ W ∥

_{2}

\leq

c by rescaling W to

\hat{W}

\frac{W}{m a x ( 1 , σ _{m a x} ( W ) / c )}

; the common choice

c = 1

yields

\hat{W}

= W/

σ_{m a x}

(W) when

σ_{m a x}

(W) > 1, otherwise

\hat{W}

= W. Because computing

σ_{m a x}

(W) exactly via SVD is expensive, we approximate it with power iteration. If u

\in

R^{m}

and v

\in

R^{n}

are unit vectors, iterate v

\leftarrow

\frac{W ^{⊤} u}{∥ W ^{⊤} u ∥ _{2}}

and u

\leftarrow

\frac{W v}{∥ W v ∥ _{2}}

; then

σ

\approx

u^{⊤}

W v approaches

σ_{m a x}

(W). Persisting u (or v) across iterations accelerates convergence between training steps.

04When to Use

Use spectral normalization when you need to explicitly control the Lipschitz constant of layers. This is particularly helpful in: (1) adversarial settings like GANs, where the discriminator benefits from stable gradients; (2) scenarios with potential gradient explosion (very deep networks or recurrent structures); (3) robust learning where bounded sensitivity to input perturbations is desired; and (4) theoretical contexts where proving generalization or robustness requires Lipschitz bounds. It is also useful when batch statistics are unreliable (tiny batches or highly non-stationary data) and batch normalization may hurt more than help. In such cases, spectral normalization provides a data-independent, parameter-level constraint that does not depend on batch moments. For convolutional layers, reshape the kernel to 2D and apply the same procedure; this keeps the per-layer Lipschitz bounds interpretable and consistent with fully connected layers. Avoid it if: you rely on exact parameter magnitudes for other regularizers that might conflict with rescaling; or your model already operates near the edge of capacity and the constraint would overly limit expressiveness. In some tasks, alternative regularizers (e.g., weight decay or orthogonal regularization) may suffice with less computational overhead.

⚠️Common Mistakes

• Confusing norms: The Frobenius norm |W|{F} is not the spectral norm |W|{2}. Minimizing or constraining |W|{F} does not cap the worst-case expansion; always use the largest singular value for spectral normalization. • Too few power iterations: Using only one iteration with a poor initialization can badly underestimate \sigma{\max}, causing under-normalization and instability. Persist the singular vector estimate across steps and use 1–3 iterations per update after a warm start. • Forgetting to reshape conv kernels: Spectral normalization for convolutions requires flattening to a 2D matrix of shape (out_channels, in_channels \times k_H \times k_W). Applying per-filter or per-weight normalization is not equivalent. • Not handling near-zero vectors: Numerical issues arise when |Wv| or |W^{\top}u| is extremely small. Add small epsilons during normalization to avoid division by zero and reinitialize vectors if they collapse. • Rescaling biases or activations incorrectly: Only the weight matrix needs spectral normalization. Do not scale the bias term or post-activation outputs separately as part of the normalization step. • Ignoring update frequency: Updating the spectral estimate too infrequently can drift the actual norm above the target; too frequently wastes compute. Balance by running a few power-iteration steps per parameter update with a persistent u (or v).

Key Formulas

Operator Norm (Euclidean)

∥ W ∥_{2} = x \neq = 0 sup \frac{∥ W x ∥ _{2}}{∥ x ∥ _{2}}

Explanation: This defines the spectral norm as the maximum amplification factor of any input vector under W. It directly measures worst-case stretch.

Spectral Norm as Largest Singular Value

∥ W ∥_{2} = σ_{m a x} (W)

Explanation: The spectral norm equals the largest singular value from the SVD of W. This is the standard way to compute or reference it.

SVD

W = U Σ V^{⊤}, U^{⊤} U = I, V^{⊤} V = I

Explanation: Any real matrix factors into orthogonal matrices U and V and a diagonal matrix Σ of singular values. The largest entry of Σ is the spectral norm.

Power Iteration Updates

v_{t + 1} = \frac{W ^{⊤} u _{t}}{∥ W ^{⊤} u _{t} ∥ _{2}}, u_{t + 1} = \frac{W v _{t + 1}}{∥ W v _{t + 1} ∥ _{2}}

Explanation: Alternating multiplications by W and $W^{T}$ with normalization steer u and v toward the top singular vectors. The Rayleigh quotient $u^{T}$ W v estimates the top singular value.

Rayleigh Quotient for Singular Value

σ_{t} = u_{t}^{⊤} W v_{t}

Explanation: Given approximately aligned u and v, this inner product gives an estimate of the largest singular value. It converges as power iteration proceeds.

Projection to Spectral-Norm Ball

\hat{W} = \frac{W}{max { 1 , σ _{m a x} ( W ) / c }}

Explanation: Rescales W so its spectral norm does not exceed c. With c = 1, it reduces to dividing by the estimated top singular value when needed.

Network Lipschitz Bound

L_{network} \leq ℓ = 1 \prod L ∥ W_{ℓ} ∥_{2}

Explanation: The Lipschitz constant of a composition of linear layers and 1-Lipschitz activations is bounded by the product of their spectral norms. Spectral normalization caps each factor.

SVD Time Complexity

T_{SVD} (m, n) = O (min {m n^{2}, m^{2} n})

Explanation: Exact SVD is generally cubic in the smaller dimension. This is often too slow for per-step training updates.

Power Iteration Time Complexity

T_{power} (m, n, k) = O (kmn)

Explanation: Each of k iterations performs two matrix–vector multiplies (Wv and $W^{T}$ u). This is far cheaper than SVD when k is small.

Convergence Factor

gap = \frac{σ _{2}}{σ _{1}} < 1

Explanation: The ratio between the top two singular values controls the convergence speed of power iteration. Smaller ratios yield faster convergence.

Complexity Analysis

Let W be an m×n matrix. Exact computation of the spectral norm via SVD costs O(min{m

n^{2}

m^{2} n

}) time and O(mn) space to store W plus additional O(min{

m^{2}

n^{2}

}) for intermediate factors. This is typically prohibitive per training step in deep learning. Power iteration approximates the top singular value at substantially lower cost. Each iteration performs one multiplication by W and one by

W^{T}

, each costing O(mn) for dense matrices (or O(nnz) for sparse matrices, where nnz is the number of nonzeros). With k iterations, the total time is O(kmn); in practice k ∈ {1, 2, 3} per update after warm-starting from the previous singular vector estimate. Memory overhead beyond storing W is O(m + n) to hold the working vectors u and v. Numerical stability is maintained by normalizing vectors at each step and adding a small epsilon to avoid division by zero. Applying spectral normalization by rescaling W to Ŵ = W / max(1,

σ / c)

has O(mn) time (to scale all entries) and O(1) extra space beyond W. For convolutional layers with kernel dimensions (ou

t_{c}

, i

n_{c}

, kH, kW), we reshape to a matrix of size ou

t_{c}

× (i

n_{c}

·kH·kW). The complexity then follows the same O(k·ou

t_{c}

·i

n_{c}

·kH·kW) per update. Overall, compared to SVD, power iteration reduces per-step cost by orders of magnitude, making spectral normalization practical during training. The trade-off is approximation error, which can be mitigated by persistent vectors and occasional additional iterations.

Code Examples

Approximate spectral norm with power iteration for a dense matrix

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix {
5     size_t rows, cols;
6     vector<double> a; // row-major data
7     Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
8     double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
9     double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
10 };
11 
12 static double l2norm(const vector<double>& v) {
13     double s = 0.0; for (double x : v) s += x*x; return sqrt(max(0.0, s));
14 }
15 
16 static void normalize(vector<double>& v, double eps=1e-12) {
17     double n = l2norm(v);
18     if (n < eps) {
19         // Reinitialize to a unit basis vector to avoid stagnation
20         fill(v.begin(), v.end(), 0.0);
21         if (!v.empty()) v[0] = 1.0;
22     } else {
23         for (double &x : v) x /= n;
24     }
25 }
26 
27 static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
28     // y = W x
29     size_t m = W.rows, n = W.cols;
30     y.assign(m, 0.0);
31     for (size_t i = 0; i < m; ++i) {
32         double s = 0.0;
33         const double* row = &W.a[i*n];
34         for (size_t j = 0; j < n; ++j) s += row[j] * x[j];
35         y[i] = s;
36     }
37 }
38 
39 static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
40     // y = W^T x
41     size_t m = W.rows, n = W.cols;
42     y.assign(n, 0.0);
43     for (size_t i = 0; i < m; ++i) {
44         const double* row = &W.a[i*n];
45         double xi = x[i];
46         for (size_t j = 0; j < n; ++j) y[j] += row[j] * xi;
47     }
48 }
49 
50 // Power iteration to approximate the largest singular value and vectors
51 struct PowerIterState {
52     vector<double> u; // approximates left singular vector (size m)
53 };
54 
55 static double spectral_norm_power_iter(const Matrix& W, PowerIterState& state, int iters=5, double eps=1e-12) {
56     size_t m = W.rows, n = W.cols;
57     if (state.u.size() != m) {
58         state.u.assign(m, 0.0);
59         // Random unit vector initialization
60         std::mt19937 rng(42);
61         std::normal_distribution<double> nd(0.0, 1.0);
62         for (size_t i = 0; i < m; ++i) state.u[i] = nd(rng);
63         normalize(state.u, eps);
64     }
65     vector<double> v(n, 0.0), Wu(m, 0.0), WTv(n, 0.0);
66     vector<double>& u = state.u;
67     for (int t = 0; t < iters; ++t) {
68         // v <- normalize(W^T u)
69         matTvec(W, u, v);
70         normalize(v, eps);
71         // u <- normalize(W v)
72         matvec(W, v, u);
73         normalize(u, eps);
74     }
75     // Rayleigh quotient sigma ≈ u^T W v
76     matvec(W, v, Wu);
77     double sigma = 0.0;
78     for (size_t i = 0; i < m; ++i) sigma += u[i] * Wu[i];
79     return fabs(sigma);
80 }
81 
82 int main() {
83     // Example: random 5x3 matrix
84     size_t m = 5, n = 3;
85     Matrix W(m, n);
86     std::mt19937 rng(123);
87     std::normal_distribution<double> nd(0.0, 1.0);
88     for (size_t i = 0; i < m; ++i)
89         for (size_t j = 0; j < n; ++j)
90             W(i,j) = nd(rng);
91 
92     PowerIterState st; // persists u across calls
93     for (int round = 0; round < 3; ++round) {
94         double sigma = spectral_norm_power_iter(W, st, /*iters=*/10);
95         cout << "Approximate spectral norm (round " << round << ") = " << sigma << "\n";
96     }
97 }
98

This program defines a simple dense Matrix type and computes an approximation of the largest singular value via power iteration. It maintains a persistent left singular vector estimate u to accelerate convergence across repeated calls. The Rayleigh quotient u^T W v yields the spectral norm estimate.

Time: O(k m n) for k power iterations on an m×n dense matrixSpace: O(mn) to store W and O(m + n) extra for u, v, and work vectors

Apply spectral normalization to a linear layer with a persistent power-iteration state

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix { // same as previous example (minimal redefinition for self-contained code)
5     size_t rows, cols; vector<double> a; Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
6     double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
7     double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
8 };
9 
10 static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
11     size_t m=W.rows, n=W.cols; y.assign(m,0.0);
12     for (size_t i=0;i<m;++i){ double s=0.0; const double* row=&W.a[i*n]; for(size_t j=0;j<n;++j) s+=row[j]*x[j]; y[i]=s; }
13 }
14 static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
15     size_t m=W.rows, n=W.cols; y.assign(n,0.0);
16     for (size_t i=0;i<m;++i){ const double* row=&W.a[i*n]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]*xi; }
17 }
18 static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
19 static void normalize(vector<double>& v, double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
20 
21 struct PowerIterState { vector<double> u; };
22 
23 static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1, double eps=1e-12){
24     size_t m=W.rows, n=W.cols;
25     if (st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(7); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);}    
26     vector<double> v(n,0.0), u=st.u, tmp(m,0.0);
27     for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);}    
28     // Save u back to state
29     st.u=u;
30     matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);
31 }
32 
33 struct LinearSN {
34     Matrix W; // weights (out_features x in_features)
35     vector<double> b; // bias (out_features)
36     double c = 1.0;   // target spectral norm bound
37     PowerIterState st; // persistent state for power iteration
38 
39     LinearSN(size_t out_f, size_t in_f): W(out_f, in_f), b(out_f, 0.0) {}
40 
41     void apply_spectral_norm(int iters=1){
42         double sigma = spectral_norm_power_iter(W, st, iters);
43         double scale = (sigma > c ? sigma / c : 1.0);
44         if (scale > 1.0){
45             for (double &w : W.a) w /= scale; // rescale W so that ||W||_2 <= c
46         }
47     }
48 
49     vector<double> forward(const vector<double>& x){
50         vector<double> y; matvec(W, x, y);
51         for (size_t i=0;i<y.size();++i) y[i] += b[i]; // bias is NOT scaled by spectral norm
52         return y;
53     }
54 };
55 
56 int main(){
57     // Create a layer with 4 outputs and 3 inputs
58     LinearSN layer(4,3);
59     mt19937 rng(123); normal_distribution<double> nd(0.0,1.0);
60     for (size_t i=0;i<layer.W.rows;++i)
61         for (size_t j=0;j<layer.W.cols;++j)
62             layer.W(i,j) = 3.0 * nd(rng); // intentionally large scale
63     for (double &bi : layer.b) bi = 0.1 * nd(rng);
64 
65     // Warm up: a few iterations to get a good singular vector estimate
66     layer.apply_spectral_norm(/*iters=*/5);
67 
68     // After warm start, do cheap per-step updates
69     for (int step=0; step<3; ++step){
70         layer.apply_spectral_norm(/*iters=*/1); // keep ||W||_2 <= 1
71         // Forward on a sample input
72         vector<double> x = {1.0, -2.0, 0.5};
73         vector<double> y = layer.forward(x);
74         cout << "Step " << step << ": output = ";
75         for (double yi : y) cout << fixed << setprecision(4) << yi << ' ';
76         cout << "\n";
77     }
78 }
79

This example implements a linear layer that maintains a spectral norm constraint using a persistent power-iteration state. A warm start uses more iterations to align the singular vectors; subsequent updates use just one iteration, which is sufficient to maintain ||W||_2 ≤ c during training. The bias is not scaled.

Time: Each apply_spectral_norm call is O(k m n), where k is the number of power iterationsSpace: O(mn) for weights plus O(m + n) for the persistent vectors

Spectral normalization for a convolutional kernel via reshape

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct Matrix { size_t rows, cols; vector<double> a; Matrix(size_t r=0,size_t c=0):rows(r),cols(c),a(r*c,0.0){} double &operator()(size_t i,size_t j){return a[i*cols+j];} double operator()(size_t i,size_t j)const{return a[i*cols+j];} };
5 static void matvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(m,0.0); for(size_t i=0;i<m;++i){ double s=0; const double* row=&W.a[i*n]; for(size_t j=0;j<n;++j) s+=row[j]*x[j]; y[i]=s; } }
6 static void matTvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(n,0.0); for(size_t i=0;i<m;++i){ const double* row=&W.a[i*n]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]*xi; } }
7 static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
8 static void normalize(vector<double>& v,double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
9 struct PowerIterState{ vector<double> u; };
10 static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1,double eps=1e-12){ size_t m=W.rows,n=W.cols; if(st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(17); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);} vector<double> v(n,0.0), u=st.u, tmp(m,0.0); for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);} st.u=u; matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);} 
11 
12 // Flatten conv kernel (out_c, in_c, kH, kW) -> Matrix(out_c, in_c*kH*kW)
13 static Matrix flatten_conv_kernel(const vector<double>& kernel, size_t out_c, size_t in_c, size_t kH, size_t kW){
14     size_t rows = out_c; size_t cols = in_c * kH * kW;
15     Matrix W(rows, cols);
16     // kernel layout assumed: [out_c][in_c][kH][kW] in row-major order
17     size_t idx = 0;
18     for (size_t oc = 0; oc < out_c; ++oc){
19         for (size_t ic = 0; ic < in_c; ++ic){
20             for (size_t kh = 0; kh < kH; ++kh){
21                 for (size_t kw = 0; kw < kW; ++kw){
22                     size_t col = ic*kH*kW + kh*kW + kw;
23                     W(oc, col) = kernel[idx++];
24                 }
25             }
26         }
27     }
28     return W;
29 }
30 
31 static void assign_back_conv_kernel(vector<double>& kernel, const Matrix& W, size_t out_c, size_t in_c, size_t kH, size_t kW){
32     size_t idx = 0; size_t cols = in_c*kH*kW;
33     for (size_t oc = 0; oc < out_c; ++oc){
34         for (size_t ic = 0; ic < in_c; ++ic){
35             for (size_t kh = 0; kh < kH; ++kh){
36                 for (size_t kw = 0; kw < kW; ++kw){
37                     size_t col = ic*kH*kW + kh*kW + kw;
38                     kernel[idx++] = W(oc, col);
39                 }
40             }
41         }
42     }
43 }
44 
45 int main(){
46     size_t out_c=8, in_c=3, kH=3, kW=3;
47     size_t total = out_c*in_c*kH*kW;
48     vector<double> kernel(total, 0.0);
49     mt19937 rng(999); normal_distribution<double> nd(0.0, 1.0);
50     for (double &x : kernel) x = 2.5 * nd(rng); // deliberately large scale
51 
52     // Flatten, normalize spectrally, and write back
53     Matrix W = flatten_conv_kernel(kernel, out_c, in_c, kH, kW);
54     PowerIterState st; // persistent across training steps in real usage
55     // Warm start then cheap maintenance
56     double sigma0 = spectral_norm_power_iter(W, st, /*iters=*/5);
57     double c = 1.0; double scale = (sigma0 > c ? sigma0 / c : 1.0);
58     if (scale > 1.0) for (double &w : W.a) w /= scale;
59 
60     // Optionally maintain with one iteration per step
61     double sigma1 = spectral_norm_power_iter(W, st, /*iters=*/1);
62     (void)sigma1; // not used further here
63 
64     // Write back to 4D kernel layout
65     assign_back_conv_kernel(kernel, W, out_c, in_c, kH, kW);
66 
67     cout << "Applied spectral normalization to conv kernel (out_c=" << out_c
68          << ", in_c=" << in_c << ", kH=" << kH << ", kW=" << kW << ")\n";
69 }
70

Convolution kernels are reshaped to a 2D matrix of shape (out_channels, in_channels × kernel_height × kernel_width). Spectral normalization is applied to this matrix using power iteration, then the normalized weights are written back into the original 4D layout.

Time: O(k · out_c · in_c · kH · kW) for k power iterationsSpace: O(out_c · in_c · kH · kW) to store the kernel and O(out_c + in_c · kH · kW) extra for working vectors

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix {
5	size_t rows, cols;
6	vector<double> a; // row-major data
7	Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
8	double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
9	double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
10	};
11
12	static double l2norm(const vector<double>& v) {
13	double s = 0.0; for (double x : v) s += x*x; return sqrt(max(0.0, s));
14	}
15
16	static void normalize(vector<double>& v, double eps=1e-12) {
17	double n = l2norm(v);
18	if (n < eps) {
19	// Reinitialize to a unit basis vector to avoid stagnation
20	fill(v.begin(), v.end(), 0.0);
21	if (!v.empty()) v[0] = 1.0;
22	} else {
23	for (double &x : v) x /= n;
24	}
25	}
26
27	static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
28	// y = W x
29	size_t m = W.rows, n = W.cols;
30	y.assign(m, 0.0);
31	for (size_t i = 0; i < m; ++i) {
32	double s = 0.0;
33	const double* row = &W.a[i*n];
34	for (size_t j = 0; j < n; ++j) s += row[j] * x[j];
35	y[i] = s;
36	}
37	}
38
39	static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
40	// y = W^T x
41	size_t m = W.rows, n = W.cols;
42	y.assign(n, 0.0);
43	for (size_t i = 0; i < m; ++i) {
44	const double* row = &W.a[i*n];
45	double xi = x[i];
46	for (size_t j = 0; j < n; ++j) y[j] += row[j] * xi;
47	}
48	}
49
50	// Power iteration to approximate the largest singular value and vectors
51	struct PowerIterState {
52	vector<double> u; // approximates left singular vector (size m)
53	};
54
55	static double spectral_norm_power_iter(const Matrix& W, PowerIterState& state, int iters=5, double eps=1e-12) {
56	size_t m = W.rows, n = W.cols;
57	if (state.u.size() != m) {
58	state.u.assign(m, 0.0);
59	// Random unit vector initialization
60	std::mt19937 rng(42);
61	std::normal_distribution<double> nd(0.0, 1.0);
62	for (size_t i = 0; i < m; ++i) state.u[i] = nd(rng);
63	normalize(state.u, eps);
64	}
65	vector<double> v(n, 0.0), Wu(m, 0.0), WTv(n, 0.0);
66	vector<double>& u = state.u;
67	for (int t = 0; t < iters; ++t) {
68	// v <- normalize(W^T u)
69	matTvec(W, u, v);
70	normalize(v, eps);
71	// u <- normalize(W v)
72	matvec(W, v, u);
73	normalize(u, eps);
74	}
75	// Rayleigh quotient sigma ≈ u^T W v
76	matvec(W, v, Wu);
77	double sigma = 0.0;
78	for (size_t i = 0; i < m; ++i) sigma += u[i] * Wu[i];
79	return fabs(sigma);
80	}
81
82	int main() {
83	// Example: random 5x3 matrix
84	size_t m = 5, n = 3;
85	Matrix W(m, n);
86	std::mt19937 rng(123);
87	std::normal_distribution<double> nd(0.0, 1.0);
88	for (size_t i = 0; i < m; ++i)
89	for (size_t j = 0; j < n; ++j)
90	W(i,j) = nd(rng);
91
92	PowerIterState st; // persists u across calls
93	for (int round = 0; round < 3; ++round) {
94	double sigma = spectral_norm_power_iter(W, st, /iters=/10);
95	cout << "Approximate spectral norm (round " << round << ") = " << sigma << "\n";
96	}
97	}
98

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix { // same as previous example (minimal redefinition for self-contained code)
5	size_t rows, cols; vector<double> a; Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
6	double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
7	double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
8	};
9
10	static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
11	size_t m=W.rows, n=W.cols; y.assign(m,0.0);
12	for (size_t i=0;i<m;++i){ double s=0.0; const double* row=&W.a[in]; for(size_t j=0;j<n;++j) s+=row[j]x[j]; y[i]=s; }
13	}
14	static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
15	size_t m=W.rows, n=W.cols; y.assign(n,0.0);
16	for (size_t i=0;i<m;++i){ const double* row=&W.a[in]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]xi; }
17	}
18	static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
19	static void normalize(vector<double>& v, double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
20
21	struct PowerIterState { vector<double> u; };
22
23	static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1, double eps=1e-12){
24	size_t m=W.rows, n=W.cols;
25	if (st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(7); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);}
26	vector<double> v(n,0.0), u=st.u, tmp(m,0.0);
27	for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);}
28	// Save u back to state
29	st.u=u;
30	matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);
31	}
32
33	struct LinearSN {
34	Matrix W; // weights (out_features x in_features)
35	vector<double> b; // bias (out_features)
36	double c = 1.0; // target spectral norm bound
37	PowerIterState st; // persistent state for power iteration
38
39	LinearSN(size_t out_f, size_t in_f): W(out_f, in_f), b(out_f, 0.0) {}
40
41	void apply_spectral_norm(int iters=1){
42	double sigma = spectral_norm_power_iter(W, st, iters);
43	double scale = (sigma > c ? sigma / c : 1.0);
44	if (scale > 1.0){
45	for (double &w : W.a) w /= scale; // rescale W so that \|\|W\|\|_2 <= c
46	}
47	}
48
49	vector<double> forward(const vector<double>& x){
50	vector<double> y; matvec(W, x, y);
51	for (size_t i=0;i<y.size();++i) y[i] += b[i]; // bias is NOT scaled by spectral norm
52	return y;
53	}
54	};
55
56	int main(){
57	// Create a layer with 4 outputs and 3 inputs
58	LinearSN layer(4,3);
59	mt19937 rng(123); normal_distribution<double> nd(0.0,1.0);
60	for (size_t i=0;i<layer.W.rows;++i)
61	for (size_t j=0;j<layer.W.cols;++j)
62	layer.W(i,j) = 3.0 * nd(rng); // intentionally large scale
63	for (double &bi : layer.b) bi = 0.1 * nd(rng);
64
65	// Warm up: a few iterations to get a good singular vector estimate
66	layer.apply_spectral_norm(/iters=/5);
67
68	// After warm start, do cheap per-step updates
69	for (int step=0; step<3; ++step){
70	layer.apply_spectral_norm(/iters=/1); // keep \|\|W\|\|_2 <= 1
71	// Forward on a sample input
72	vector<double> x = {1.0, -2.0, 0.5};
73	vector<double> y = layer.forward(x);
74	cout << "Step " << step << ": output = ";
75	for (double yi : y) cout << fixed << setprecision(4) << yi << ' ';
76	cout << "\n";
77	}
78	}
79

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct Matrix { size_t rows, cols; vector<double> a; Matrix(size_t r=0,size_t c=0):rows(r),cols(c),a(rc,0.0){} double &operator()(size_t i,size_t j){return a[icols+j];} double operator()(size_t i,size_t j)const{return a[i*cols+j];} };
5	static void matvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(m,0.0); for(size_t i=0;i<m;++i){ double s=0; const double* row=&W.a[in]; for(size_t j=0;j<n;++j) s+=row[j]x[j]; y[i]=s; } }
6	static void matTvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(n,0.0); for(size_t i=0;i<m;++i){ const double* row=&W.a[in]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]xi; } }
7	static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
8	static void normalize(vector<double>& v,double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
9	struct PowerIterState{ vector<double> u; };
10	static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1,double eps=1e-12){ size_t m=W.rows,n=W.cols; if(st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(17); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);} vector<double> v(n,0.0), u=st.u, tmp(m,0.0); for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);} st.u=u; matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);}
11
12	// Flatten conv kernel (out_c, in_c, kH, kW) -> Matrix(out_c, in_ckHkW)
13	static Matrix flatten_conv_kernel(const vector<double>& kernel, size_t out_c, size_t in_c, size_t kH, size_t kW){
14	size_t rows = out_c; size_t cols = in_c * kH * kW;
15	Matrix W(rows, cols);
16	// kernel layout assumed: [out_c][in_c][kH][kW] in row-major order
17	size_t idx = 0;
18	for (size_t oc = 0; oc < out_c; ++oc){
19	for (size_t ic = 0; ic < in_c; ++ic){
20	for (size_t kh = 0; kh < kH; ++kh){
21	for (size_t kw = 0; kw < kW; ++kw){
22	size_t col = ickHkW + kh*kW + kw;
23	W(oc, col) = kernel[idx++];
24	}
25	}
26	}
27	}
28	return W;
29	}
30
31	static void assign_back_conv_kernel(vector<double>& kernel, const Matrix& W, size_t out_c, size_t in_c, size_t kH, size_t kW){
32	size_t idx = 0; size_t cols = in_ckHkW;
33	for (size_t oc = 0; oc < out_c; ++oc){
34	for (size_t ic = 0; ic < in_c; ++ic){
35	for (size_t kh = 0; kh < kH; ++kh){
36	for (size_t kw = 0; kw < kW; ++kw){
37	size_t col = ickHkW + kh*kW + kw;
38	kernel[idx++] = W(oc, col);
39	}
40	}
41	}
42	}
43	}
44
45	int main(){
46	size_t out_c=8, in_c=3, kH=3, kW=3;
47	size_t total = out_cin_ckH*kW;
48	vector<double> kernel(total, 0.0);
49	mt19937 rng(999); normal_distribution<double> nd(0.0, 1.0);
50	for (double &x : kernel) x = 2.5 * nd(rng); // deliberately large scale
51
52	// Flatten, normalize spectrally, and write back
53	Matrix W = flatten_conv_kernel(kernel, out_c, in_c, kH, kW);
54	PowerIterState st; // persistent across training steps in real usage
55	// Warm start then cheap maintenance
56	double sigma0 = spectral_norm_power_iter(W, st, /iters=/5);
57	double c = 1.0; double scale = (sigma0 > c ? sigma0 / c : 1.0);
58	if (scale > 1.0) for (double &w : W.a) w /= scale;
59
60	// Optionally maintain with one iteration per step
61	double sigma1 = spectral_norm_power_iter(W, st, /iters=/1);
62	(void)sigma1; // not used further here
63
64	// Write back to 4D kernel layout
65	assign_back_conv_kernel(kernel, W, out_c, in_c, kH, kW);
66
67	cout << "Applied spectral normalization to conv kernel (out_c=" << out_c
68	<< ", in_c=" << in_c << ", kH=" << kH << ", kW=" << kW << ")\n";
69	}
70