🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Spectral Normalization

Key Points

  • •
    Spectral normalization rescales a weight matrix so its largest singular value (spectral norm) is at most a target value, typically 1.
  • •
    It controls a layer’s Lipschitz constant, helping stabilize training and prevent exploding activations or gradients.
  • •
    The spectral norm equals the maximum stretch a matrix applies to any vector, and it is the largest singular value of the matrix.
  • •
    A fast way to approximate it is power iteration, which alternates multiplying by W and WT and normalizing vectors.
  • •
    In practice you keep a running estimate of the top singular vectors to make updates cheap during training.
  • •
    For convolutional kernels, reshape the 4D tensor to a 2D matrix (outc​hannels by inc​hannels × kernels​ize) before normalizing.
  • •
    Spectral normalization differs from weight normalization and batch normalization; it constrains operator norm, not per-weight magnitude or per-batch statistics.
  • •
    The time cost per update using k power iterations is O(kmn) for an m×n matrix, which is much cheaper than a full SVD.

Prerequisites

  • →Vectors and matrices — Understanding matrix–vector multiplication and basic linear operations is essential for spectral norms and power iteration.
  • →Matrix and vector norms — Spectral normalization constrains an operator norm; knowing different norms clarifies why spectral norm specifically bounds worst-case stretch.
  • →Singular Value Decomposition (SVD) — The spectral norm equals the largest singular value; SVD provides the theoretical foundation and properties of singular vectors.
  • →Numerical linear algebra basics — Power iteration, convergence, and numerical stability require familiarity with iterative algorithms and floating-point issues.
  • →C++ programming basics — Implementing and using spectral normalization requires competence with arrays, loops, and simple class structures.
  • →Neural network layers and training — Motivation and correct placement of spectral normalization depend on how layers compose and how training updates weights.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral normalization is a technique that rescales a neural network layer’s weight matrix so that its spectral norm—the largest singular value—does not exceed a chosen bound (often 1). Intuitively, the spectral norm measures the maximum factor by which the matrix can stretch any input vector. By bounding this stretch, spectral normalization controls the Lipschitz constant of the layer, which in turn helps stabilize optimization, prevents exploding activations or gradients, and often improves generalization. Originally popularized for stabilizing Generative Adversarial Networks (GANs), spectral normalization applies broadly to deep networks that benefit from controlled layer sensitivity. A direct computation of the spectral norm via singular value decomposition (SVD) can be expensive, so practitioners typically approximate it efficiently with a few steps of power iteration, which only requires matrix–vector multiplications with W and its transpose W^T. The weight matrix is then divided by the estimated largest singular value to enforce the bound. This approach differs from techniques like weight decay or batch normalization. Weight decay penalizes sum-of-squares of parameters, and batch normalization normalizes activations using batch statistics, whereas spectral normalization directly constrains the operator norm of the linear transformation itself. The result is a principled cap on how much a layer can amplify inputs, yielding more predictable network behavior.

02Intuition & Analogies

Imagine a stretchy fabric grid drawn on a table. Each point on the fabric has coordinates (like a vector). When you apply a linear transformation (your weight matrix), you’re grabbing the fabric and pulling, twisting, and possibly skewing it. Some directions might barely move, while one special direction might get stretched the most. That maximum stretch factor is the spectral norm. If you let that stretch grow unchecked, tiny imperfections or noise can be amplified enormously, causing instability. Spectral normalization is like putting a rule on how much you’re allowed to stretch the fabric in any direction—say, no more than 1. The fabric can still rotate, flip, or slightly stretch, but there’s a hard limit on the worst-case expansion. In a neural network, this limit translates into a bound on how sensitive a layer’s output can be to changes in its input. The trick to finding how stretchy the fabric is in the worst direction is power iteration. Think of repeatedly pushing a stick (a vector) through the fabric transformation and straightening it. Each push aligns the stick closer to the most-stretched direction. After a few pushes, you also measure how much longer the stick gets—this approximates the largest singular value. Once you know that number, you simply scale the transformation down so the worst-case stretch is at your desired cap. By keeping that cap tight across layers, the whole network behaves more predictably, like a machine where every gear’s torque is bounded so the system can’t suddenly lurch out of control.

03Formal Definition

For a real matrix W ∈ Rm×n, the spectral norm is defined as \∣W∥_{2} = σmax​(W), the largest singular value of W. Equivalently, it is the induced operator norm of W with respect to the Euclidean norm: \∣W∥_{2} = supx=0​ ∥x∥2​∥Wx∥2​​. A linear layer f(x) = Wx + b is L-Lipschitz with L = \∣W∥_{2}. Spectral normalization enforces a constraint \∣W∥_{2} ≤ c by rescaling W to W^ = max(1,σmax​(W)/c)W​; the common choice c=1 yields W^ = W/σmax​(W) when σmax​(W) > 1, otherwise W^ = W. Because computing σmax​(W) exactly via SVD is expensive, we approximate it with power iteration. If u ∈ Rm and v ∈ Rn are unit vectors, iterate v ← ∥W⊤u∥2​W⊤u​ and u ← ∥Wv∥2​Wv​; then σ ≈ u⊤ W v approaches σmax​(W). Persisting u (or v) across iterations accelerates convergence between training steps.

04When to Use

Use spectral normalization when you need to explicitly control the Lipschitz constant of layers. This is particularly helpful in: (1) adversarial settings like GANs, where the discriminator benefits from stable gradients; (2) scenarios with potential gradient explosion (very deep networks or recurrent structures); (3) robust learning where bounded sensitivity to input perturbations is desired; and (4) theoretical contexts where proving generalization or robustness requires Lipschitz bounds. It is also useful when batch statistics are unreliable (tiny batches or highly non-stationary data) and batch normalization may hurt more than help. In such cases, spectral normalization provides a data-independent, parameter-level constraint that does not depend on batch moments. For convolutional layers, reshape the kernel to 2D and apply the same procedure; this keeps the per-layer Lipschitz bounds interpretable and consistent with fully connected layers. Avoid it if: you rely on exact parameter magnitudes for other regularizers that might conflict with rescaling; or your model already operates near the edge of capacity and the constraint would overly limit expressiveness. In some tasks, alternative regularizers (e.g., weight decay or orthogonal regularization) may suffice with less computational overhead.

⚠️Common Mistakes

• Confusing norms: The Frobenius norm |W|{F} is not the spectral norm |W|{2}. Minimizing or constraining |W|{F} does not cap the worst-case expansion; always use the largest singular value for spectral normalization. • Too few power iterations: Using only one iteration with a poor initialization can badly underestimate \sigma{\max}, causing under-normalization and instability. Persist the singular vector estimate across steps and use 1–3 iterations per update after a warm start. • Forgetting to reshape conv kernels: Spectral normalization for convolutions requires flattening to a 2D matrix of shape (out_channels, in_channels \times k_H \times k_W). Applying per-filter or per-weight normalization is not equivalent. • Not handling near-zero vectors: Numerical issues arise when |Wv| or |W^{\top}u| is extremely small. Add small epsilons during normalization to avoid division by zero and reinitialize vectors if they collapse. • Rescaling biases or activations incorrectly: Only the weight matrix needs spectral normalization. Do not scale the bias term or post-activation outputs separately as part of the normalization step. • Ignoring update frequency: Updating the spectral estimate too infrequently can drift the actual norm above the target; too frequently wastes compute. Balance by running a few power-iteration steps per parameter update with a persistent u (or v).

Key Formulas

Operator Norm (Euclidean)

∥W∥2​=x=0sup​∥x∥2​∥Wx∥2​​

Explanation: This defines the spectral norm as the maximum amplification factor of any input vector under W. It directly measures worst-case stretch.

Spectral Norm as Largest Singular Value

∥W∥2​=σmax​(W)

Explanation: The spectral norm equals the largest singular value from the SVD of W. This is the standard way to compute or reference it.

SVD

W=UΣV⊤,U⊤U=I, V⊤V=I

Explanation: Any real matrix factors into orthogonal matrices U and V and a diagonal matrix Σ of singular values. The largest entry of Σ is the spectral norm.

Power Iteration Updates

vt+1​=∥W⊤ut​∥2​W⊤ut​​,ut+1​=∥Wvt+1​∥2​Wvt+1​​

Explanation: Alternating multiplications by W and WT with normalization steer u and v toward the top singular vectors. The Rayleigh quotient uT W v estimates the top singular value.

Rayleigh Quotient for Singular Value

σt​=ut⊤​Wvt​

Explanation: Given approximately aligned u and v, this inner product gives an estimate of the largest singular value. It converges as power iteration proceeds.

Projection to Spectral-Norm Ball

W^=max{1, σmax​(W)/c}W​

Explanation: Rescales W so its spectral norm does not exceed c. With c = 1, it reduces to dividing by the estimated top singular value when needed.

Network Lipschitz Bound

Lnetwork​≤ℓ=1∏L​∥Wℓ​∥2​

Explanation: The Lipschitz constant of a composition of linear layers and 1-Lipschitz activations is bounded by the product of their spectral norms. Spectral normalization caps each factor.

SVD Time Complexity

TSVD​(m,n)=O(min{mn2, m2n})

Explanation: Exact SVD is generally cubic in the smaller dimension. This is often too slow for per-step training updates.

Power Iteration Time Complexity

Tpower​(m,n,k)=O(kmn)

Explanation: Each of k iterations performs two matrix–vector multiplies (Wv and WT u). This is far cheaper than SVD when k is small.

Convergence Factor

gap=σ1​σ2​​<1

Explanation: The ratio between the top two singular values controls the convergence speed of power iteration. Smaller ratios yield faster convergence.

Complexity Analysis

Let W be an m×n matrix. Exact computation of the spectral norm via SVD costs O(min{mn2, m2n}) time and O(mn) space to store W plus additional O(min{m2, n2}) for intermediate factors. This is typically prohibitive per training step in deep learning. Power iteration approximates the top singular value at substantially lower cost. Each iteration performs one multiplication by W and one by WT, each costing O(mn) for dense matrices (or O(nnz) for sparse matrices, where nnz is the number of nonzeros). With k iterations, the total time is O(kmn); in practice k ∈ {1, 2, 3} per update after warm-starting from the previous singular vector estimate. Memory overhead beyond storing W is O(m + n) to hold the working vectors u and v. Numerical stability is maintained by normalizing vectors at each step and adding a small epsilon to avoid division by zero. Applying spectral normalization by rescaling W to Ŵ = W / max(1, σ/c) has O(mn) time (to scale all entries) and O(1) extra space beyond W. For convolutional layers with kernel dimensions (outc​, inc​, kH, kW), we reshape to a matrix of size outc​ × (inc​·kH·kW). The complexity then follows the same O(k·outc​·inc​·kH·kW) per update. Overall, compared to SVD, power iteration reduces per-step cost by orders of magnitude, making spectral normalization practical during training. The trade-off is approximation error, which can be mitigated by persistent vectors and occasional additional iterations.

Code Examples

Approximate spectral norm with power iteration for a dense matrix
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix {
5 size_t rows, cols;
6 vector<double> a; // row-major data
7 Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
8 double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
9 double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
10};
11
12static double l2norm(const vector<double>& v) {
13 double s = 0.0; for (double x : v) s += x*x; return sqrt(max(0.0, s));
14}
15
16static void normalize(vector<double>& v, double eps=1e-12) {
17 double n = l2norm(v);
18 if (n < eps) {
19 // Reinitialize to a unit basis vector to avoid stagnation
20 fill(v.begin(), v.end(), 0.0);
21 if (!v.empty()) v[0] = 1.0;
22 } else {
23 for (double &x : v) x /= n;
24 }
25}
26
27static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
28 // y = W x
29 size_t m = W.rows, n = W.cols;
30 y.assign(m, 0.0);
31 for (size_t i = 0; i < m; ++i) {
32 double s = 0.0;
33 const double* row = &W.a[i*n];
34 for (size_t j = 0; j < n; ++j) s += row[j] * x[j];
35 y[i] = s;
36 }
37}
38
39static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
40 // y = W^T x
41 size_t m = W.rows, n = W.cols;
42 y.assign(n, 0.0);
43 for (size_t i = 0; i < m; ++i) {
44 const double* row = &W.a[i*n];
45 double xi = x[i];
46 for (size_t j = 0; j < n; ++j) y[j] += row[j] * xi;
47 }
48}
49
50// Power iteration to approximate the largest singular value and vectors
51struct PowerIterState {
52 vector<double> u; // approximates left singular vector (size m)
53};
54
55static double spectral_norm_power_iter(const Matrix& W, PowerIterState& state, int iters=5, double eps=1e-12) {
56 size_t m = W.rows, n = W.cols;
57 if (state.u.size() != m) {
58 state.u.assign(m, 0.0);
59 // Random unit vector initialization
60 std::mt19937 rng(42);
61 std::normal_distribution<double> nd(0.0, 1.0);
62 for (size_t i = 0; i < m; ++i) state.u[i] = nd(rng);
63 normalize(state.u, eps);
64 }
65 vector<double> v(n, 0.0), Wu(m, 0.0), WTv(n, 0.0);
66 vector<double>& u = state.u;
67 for (int t = 0; t < iters; ++t) {
68 // v <- normalize(W^T u)
69 matTvec(W, u, v);
70 normalize(v, eps);
71 // u <- normalize(W v)
72 matvec(W, v, u);
73 normalize(u, eps);
74 }
75 // Rayleigh quotient sigma ≈ u^T W v
76 matvec(W, v, Wu);
77 double sigma = 0.0;
78 for (size_t i = 0; i < m; ++i) sigma += u[i] * Wu[i];
79 return fabs(sigma);
80}
81
82int main() {
83 // Example: random 5x3 matrix
84 size_t m = 5, n = 3;
85 Matrix W(m, n);
86 std::mt19937 rng(123);
87 std::normal_distribution<double> nd(0.0, 1.0);
88 for (size_t i = 0; i < m; ++i)
89 for (size_t j = 0; j < n; ++j)
90 W(i,j) = nd(rng);
91
92 PowerIterState st; // persists u across calls
93 for (int round = 0; round < 3; ++round) {
94 double sigma = spectral_norm_power_iter(W, st, /*iters=*/10);
95 cout << "Approximate spectral norm (round " << round << ") = " << sigma << "\n";
96 }
97}
98

This program defines a simple dense Matrix type and computes an approximation of the largest singular value via power iteration. It maintains a persistent left singular vector estimate u to accelerate convergence across repeated calls. The Rayleigh quotient u^T W v yields the spectral norm estimate.

Time: O(k m n) for k power iterations on an m×n dense matrixSpace: O(mn) to store W and O(m + n) extra for u, v, and work vectors
Apply spectral normalization to a linear layer with a persistent power-iteration state
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix { // same as previous example (minimal redefinition for self-contained code)
5 size_t rows, cols; vector<double> a; Matrix(size_t r=0, size_t c=0): rows(r), cols(c), a(r*c, 0.0) {}
6 double &operator()(size_t i, size_t j) { return a[i*cols + j]; }
7 double operator()(size_t i, size_t j) const { return a[i*cols + j]; }
8};
9
10static void matvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
11 size_t m=W.rows, n=W.cols; y.assign(m,0.0);
12 for (size_t i=0;i<m;++i){ double s=0.0; const double* row=&W.a[i*n]; for(size_t j=0;j<n;++j) s+=row[j]*x[j]; y[i]=s; }
13}
14static void matTvec(const Matrix& W, const vector<double>& x, vector<double>& y) {
15 size_t m=W.rows, n=W.cols; y.assign(n,0.0);
16 for (size_t i=0;i<m;++i){ const double* row=&W.a[i*n]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]*xi; }
17}
18static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
19static void normalize(vector<double>& v, double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
20
21struct PowerIterState { vector<double> u; };
22
23static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1, double eps=1e-12){
24 size_t m=W.rows, n=W.cols;
25 if (st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(7); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);}
26 vector<double> v(n,0.0), u=st.u, tmp(m,0.0);
27 for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);}
28 // Save u back to state
29 st.u=u;
30 matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);
31}
32
33struct LinearSN {
34 Matrix W; // weights (out_features x in_features)
35 vector<double> b; // bias (out_features)
36 double c = 1.0; // target spectral norm bound
37 PowerIterState st; // persistent state for power iteration
38
39 LinearSN(size_t out_f, size_t in_f): W(out_f, in_f), b(out_f, 0.0) {}
40
41 void apply_spectral_norm(int iters=1){
42 double sigma = spectral_norm_power_iter(W, st, iters);
43 double scale = (sigma > c ? sigma / c : 1.0);
44 if (scale > 1.0){
45 for (double &w : W.a) w /= scale; // rescale W so that ||W||_2 <= c
46 }
47 }
48
49 vector<double> forward(const vector<double>& x){
50 vector<double> y; matvec(W, x, y);
51 for (size_t i=0;i<y.size();++i) y[i] += b[i]; // bias is NOT scaled by spectral norm
52 return y;
53 }
54};
55
56int main(){
57 // Create a layer with 4 outputs and 3 inputs
58 LinearSN layer(4,3);
59 mt19937 rng(123); normal_distribution<double> nd(0.0,1.0);
60 for (size_t i=0;i<layer.W.rows;++i)
61 for (size_t j=0;j<layer.W.cols;++j)
62 layer.W(i,j) = 3.0 * nd(rng); // intentionally large scale
63 for (double &bi : layer.b) bi = 0.1 * nd(rng);
64
65 // Warm up: a few iterations to get a good singular vector estimate
66 layer.apply_spectral_norm(/*iters=*/5);
67
68 // After warm start, do cheap per-step updates
69 for (int step=0; step<3; ++step){
70 layer.apply_spectral_norm(/*iters=*/1); // keep ||W||_2 <= 1
71 // Forward on a sample input
72 vector<double> x = {1.0, -2.0, 0.5};
73 vector<double> y = layer.forward(x);
74 cout << "Step " << step << ": output = ";
75 for (double yi : y) cout << fixed << setprecision(4) << yi << ' ';
76 cout << "\n";
77 }
78}
79

This example implements a linear layer that maintains a spectral norm constraint using a persistent power-iteration state. A warm start uses more iterations to align the singular vectors; subsequent updates use just one iteration, which is sufficient to maintain ||W||_2 ≤ c during training. The bias is not scaled.

Time: Each apply_spectral_norm call is O(k m n), where k is the number of power iterationsSpace: O(mn) for weights plus O(m + n) for the persistent vectors
Spectral normalization for a convolutional kernel via reshape
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Matrix { size_t rows, cols; vector<double> a; Matrix(size_t r=0,size_t c=0):rows(r),cols(c),a(r*c,0.0){} double &operator()(size_t i,size_t j){return a[i*cols+j];} double operator()(size_t i,size_t j)const{return a[i*cols+j];} };
5static void matvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(m,0.0); for(size_t i=0;i<m;++i){ double s=0; const double* row=&W.a[i*n]; for(size_t j=0;j<n;++j) s+=row[j]*x[j]; y[i]=s; } }
6static void matTvec(const Matrix& W,const vector<double>& x,vector<double>& y){ size_t m=W.rows,n=W.cols; y.assign(n,0.0); for(size_t i=0;i<m;++i){ const double* row=&W.a[i*n]; double xi=x[i]; for(size_t j=0;j<n;++j) y[j]+=row[j]*xi; } }
7static double l2norm(const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(0.0,s)); }
8static void normalize(vector<double>& v,double eps=1e-12){ double n=l2norm(v); if(n<eps){ fill(v.begin(),v.end(),0.0); if(!v.empty()) v[0]=1.0; } else for(double &x:v) x/=n; }
9struct PowerIterState{ vector<double> u; };
10static double spectral_norm_power_iter(const Matrix& W, PowerIterState& st, int iters=1,double eps=1e-12){ size_t m=W.rows,n=W.cols; if(st.u.size()!=m){ st.u.assign(m,0.0); mt19937 rng(17); normal_distribution<double> nd(0.0,1.0); for(size_t i=0;i<m;++i) st.u[i]=nd(rng); normalize(st.u,eps);} vector<double> v(n,0.0), u=st.u, tmp(m,0.0); for(int t=0;t<iters;++t){ matTvec(W,u,v); normalize(v,eps); matvec(W,v,u); normalize(u,eps);} st.u=u; matvec(W,v,tmp); double sigma=0.0; for(size_t i=0;i<m;++i) sigma+=u[i]*tmp[i]; return fabs(sigma);}
11
12// Flatten conv kernel (out_c, in_c, kH, kW) -> Matrix(out_c, in_c*kH*kW)
13static Matrix flatten_conv_kernel(const vector<double>& kernel, size_t out_c, size_t in_c, size_t kH, size_t kW){
14 size_t rows = out_c; size_t cols = in_c * kH * kW;
15 Matrix W(rows, cols);
16 // kernel layout assumed: [out_c][in_c][kH][kW] in row-major order
17 size_t idx = 0;
18 for (size_t oc = 0; oc < out_c; ++oc){
19 for (size_t ic = 0; ic < in_c; ++ic){
20 for (size_t kh = 0; kh < kH; ++kh){
21 for (size_t kw = 0; kw < kW; ++kw){
22 size_t col = ic*kH*kW + kh*kW + kw;
23 W(oc, col) = kernel[idx++];
24 }
25 }
26 }
27 }
28 return W;
29}
30
31static void assign_back_conv_kernel(vector<double>& kernel, const Matrix& W, size_t out_c, size_t in_c, size_t kH, size_t kW){
32 size_t idx = 0; size_t cols = in_c*kH*kW;
33 for (size_t oc = 0; oc < out_c; ++oc){
34 for (size_t ic = 0; ic < in_c; ++ic){
35 for (size_t kh = 0; kh < kH; ++kh){
36 for (size_t kw = 0; kw < kW; ++kw){
37 size_t col = ic*kH*kW + kh*kW + kw;
38 kernel[idx++] = W(oc, col);
39 }
40 }
41 }
42 }
43}
44
45int main(){
46 size_t out_c=8, in_c=3, kH=3, kW=3;
47 size_t total = out_c*in_c*kH*kW;
48 vector<double> kernel(total, 0.0);
49 mt19937 rng(999); normal_distribution<double> nd(0.0, 1.0);
50 for (double &x : kernel) x = 2.5 * nd(rng); // deliberately large scale
51
52 // Flatten, normalize spectrally, and write back
53 Matrix W = flatten_conv_kernel(kernel, out_c, in_c, kH, kW);
54 PowerIterState st; // persistent across training steps in real usage
55 // Warm start then cheap maintenance
56 double sigma0 = spectral_norm_power_iter(W, st, /*iters=*/5);
57 double c = 1.0; double scale = (sigma0 > c ? sigma0 / c : 1.0);
58 if (scale > 1.0) for (double &w : W.a) w /= scale;
59
60 // Optionally maintain with one iteration per step
61 double sigma1 = spectral_norm_power_iter(W, st, /*iters=*/1);
62 (void)sigma1; // not used further here
63
64 // Write back to 4D kernel layout
65 assign_back_conv_kernel(kernel, W, out_c, in_c, kH, kW);
66
67 cout << "Applied spectral normalization to conv kernel (out_c=" << out_c
68 << ", in_c=" << in_c << ", kH=" << kH << ", kW=" << kW << ")\n";
69}
70

Convolution kernels are reshaped to a 2D matrix of shape (out_channels, in_channels × kernel_height × kernel_width). Spectral normalization is applied to this matrix using power iteration, then the normalized weights are written back into the original 4D layout.

Time: O(k · out_c · in_c · kH · kW) for k power iterationsSpace: O(out_c · in_c · kH · kW) to store the kernel and O(out_c + in_c · kH · kW) extra for working vectors
#spectral normalization#spectral norm#singular value#power iteration#lipschitz constant#operator norm#svd#convolution kernel reshape#stability#regularization#neural networks#cpp implementation#matrix vector multiply#largest singular value#gan stabilization