🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryIntermediate

Spectral Regularization

Key Points

  • •
    Spectral regularization controls how much a weight matrix can stretch inputs by constraining its largest singular value (spectral norm).
  • •
    It stabilizes training, limits exploding activations/gradients, and provides Lipschitz control for robustness and better generalization.
  • •
    Two common approaches are adding a spectral-norm penalty to the loss or projecting/scaling weights to keep the spectral norm below a threshold.
  • •
    The spectral norm can be efficiently approximated with power iteration without computing a full SVD.
  • •
    In practice, spectral normalization rescales W by its estimated top singular value at each step to enforce a target Lipschitz constant.
  • •
    Compared with L2 weight decay (Frobenius norm), spectral regularization directly limits the worst-case amplification rather than average energy.
  • •
    Gradients for a spectral penalty can be approximated using the top singular vectors u and v (a subgradient is u vT).
  • •
    Compute-accuracy trade-offs depend on the number of power iterations; a small fixed number (1–5) is often sufficient in large models.

Prerequisites

  • →Linear algebra fundamentals — Understanding matrices, vectors, norms, and matrix–vector multiplication is essential to reason about spectral properties.
  • →Eigenvalues, singular values, and SVD — Spectral regularization is defined in terms of singular values; knowing SVD clarifies spectral norm and its vectors.
  • →Gradient-based optimization — Implementing penalties or projections requires computing gradients and performing iterative updates.
  • →Regularization in machine learning — Comparing spectral penalties with L2 weight decay and understanding bias–variance trade-offs helps in choosing methods.
  • →Lipschitz continuity and robustness — The main motivation is bounding the Lipschitz constant; this concept explains stability and robustness benefits.
  • →Iterative numerical methods — Power iteration is an iterative approximation; understanding convergence and warm starts is useful.

Detailed Explanation

Tap terms for definitions

01Overview

Spectral regularization is a family of techniques that constrain the spectral properties of weight matrices, most notably their spectral norm (largest singular value). In simple terms, it limits how much a linear transformation W can stretch an input vector. This limit is directly tied to the Lipschitz constant of the transformation and therefore to training stability, robustness, and generalization. Spectral regularization can be implemented either as a penalty added to the loss function or as an explicit constraint via projection or normalization of weights. The penalty approach augments the objective with a term like (\lambda \lVert W \rVert_2) or (\lambda \lVert W \rVert_2^2), while the constraint approach rescales or projects W so its spectral norm does not exceed a target value c. Because computing exact singular values is expensive, practical implementations rely on power iteration to estimate the top singular value and its vectors efficiently. Spectral regularization is widely used in deep learning for stabilizing training of recurrent networks, controlling the Lipschitz constant in Wasserstein GAN critics, and improving robustness against adversarial perturbations, all by ensuring that layers do not excessively amplify inputs or gradients.

02Intuition & Analogies

Imagine your model’s weight matrix as a stretchy fabric. Pull it in different directions and it stretches by different amounts. The direction where it stretches the most is the most dangerous: a tiny input change along that direction can become a big output change. Spectral regularization is like setting a maximum stretch limit on the fabric so no direction can get outrageously amplified. Another analogy is a volume knob on an amplifier. The spectral norm is the highest volume the amplifier can reach for any song (input vector). If that maximum is too high, some songs will sound distorted or blow the speakers; capping the knob keeps everything under control. In optimization terms, this cap is the Lipschitz constant: it bounds how much the output can change when the input changes. Penalty-based spectral regularization is like charging a fee the louder you go, encouraging the system to keep volumes moderate. Projection- or normalization-based methods are like an automatic limiter that turns the volume down whenever it exceeds a threshold. Compared to traditional L2 weight decay (which punishes overall weight energy), spectral regularization targets the single worst-case direction of amplification. This is why it’s particularly effective for preventing exploding activations in deep or recurrent networks and for enforcing smoothness in models that must be 1-Lipschitz, such as WGAN critics.

03Formal Definition

For a matrix \(W ∈ Rm×n\), the spectral norm (operator 2-norm) is defined as \(∥ W ∥2​ = x=0max​ ∥x∥2​∥Wx∥2​​ = σmax​(W)\), where \(σmax​(W)\) is the largest singular value of W. Equivalently, \(∥ W ∥2​ = λmax​(W⊤W)​\). The Lipschitz constant of the linear map \(x ↦ Wx\) under the Euclidean norm equals \(∥ W ∥2​\). Spectral regularization augments an empirical risk \(Ldata​(θ)\) with a spectral penalty on each weight matrix W in the model: \(L(θ) = Ldata​(θ) + ℓ∑​ λ_ℓ R(W_ℓ)\), where \(R(W)\) is commonly \(∥ W ∥2​\) or \(∥ W ∥22​\). A constraint-based variant enforces \(∥ W ∥2​ ≤ c\) by projection: after each update, replace W by \(Π∥⋅∥2​≤c​(W)\), which for \(∥ W ∥2​ > c\) equals \(∥W∥2​c​ W\). In practice, \(∥ W ∥2​\) and the top singular vectors \(u, v\) are approximated via power iteration: repeatedly update \(u ← ∥Wv∥2​Wv​\) and \(v ← ∥W⊤u∥2​W⊤u​\). When the top singular value is simple, a subgradient of \(∥ W ∥2​\) is \(uv^⊤\).

04When to Use

Use spectral regularization when you need tight control of the worst-case amplification of a layer or model. Typical scenarios include: (1) Stabilizing deep or recurrent networks where activations or gradients can explode; bounding (\lVert W \rVert_2) directly limits the step-to-step growth. (2) Enforcing Lipschitz constraints for models such as WGAN critics, where 1-Lipschitzness is required; spectral normalization achieves this by rescaling weights. (3) Improving robustness to input noise and adversarial perturbations by limiting how sensitive outputs can be to small input changes. (4) Encouraging better generalization by reducing the capacity for sharp decision boundaries that depend on extreme amplifications. Choose a penalty when you want a soft preference for small spectral norms that the optimizer can trade off against data fit. Choose projection/normalization when you must obey a hard limit (e.g., (\lVert W \rVert_2 \le c)) at all times. In large-scale training, prefer approximate methods (1–5 power iterations per step) to keep overhead manageable, and consider reusing the previous singular vectors as warm starts to reduce iterations.

⚠️Common Mistakes

Common pitfalls include: (1) Computing an exact SVD at every update, which is computationally prohibitive; use power iteration to approximate the top singular value and vectors. (2) Using too few or too many power iterations; too few yields a biased, under-estimated norm (weak constraint), while too many waste compute. Start with 1–3 and warm start from the previous iteration’s vectors. (3) Confusing Frobenius norm regularization with spectral norm regularization; L2 weight decay reduces total weight energy but does not bound worst-case amplification. (4) Forgetting to stop gradients through the normalization step in frameworks that require it; in pure C++ you must conceptually separate the rescaling from gradient computation or be aware that the normalization changes the effective parameterization. (5) Ignoring non-differentiability when the top singular value is not unique; the subgradient (uv^\top) is valid only when the top singular value is simple—otherwise, any convex combination of outer products of top singular vectors is a subgradient. (6) Applying the constraint inconsistently across layers with different scales; if you cap some layers but not others, activations may simply shift amplification. (7) Setting c too low and bottlenecking learning; monitor training loss and gradient norms to tune c or (\lambda). (8) Failing to renormalize after optimizer steps (for projection methods); always project after each update to maintain the constraint.

Key Formulas

Spectral Norm Definition

∥W∥2​=x=0max​∥x∥2​∥Wx∥2​​=σmax​(W)

Explanation: The spectral norm is the maximum amplification factor of W over all directions x. It equals the largest singular value of W.

Spectral Norm via Eigenvalues

∥W∥2​=λmax​(W⊤W)​

Explanation: The square of the spectral norm is the largest eigenvalue of WT W. This connects singular values to eigenvalues of a symmetric matrix.

Lipschitz Constant

L(f)=x=ysup​∥x−y∥2​∥f(x)−f(y)∥2​​,L(x↦Wx)=∥W∥2​

Explanation: The Lipschitz constant bounds how much outputs can change per unit input change. For a linear map, it equals the spectral norm.

Regularized Objective

L(θ)=Ldata​(θ)+ℓ∑​λℓ​R(Wℓ​),R(W)∈{∥W∥2​,∥W∥22​}

Explanation: Training can penalize large spectral norms by adding a regularization term to the data loss. The penalty can be linear or quadratic in the spectral norm.

Spectral Normalization

Wnorm​=cσmax​(W)W​

Explanation: Rescale W so that its spectral norm is exactly c. This enforces a desired Lipschitz bound per layer.

Power Iteration Updates

ut+1​=∥Wvt​∥2​Wvt​​,vt+1​=∥W⊤ut+1​∥2​W⊤ut+1​​

Explanation: Alternating between u and v converges to the top singular vectors. The estimate of the top singular value is uT W v.

Subgradient of Spectral Norm

∂∥W∥2​∋u∗(v∗)⊤

Explanation: When the largest singular value is unique, the outer product of its singular vectors is a valid subgradient. It guides gradient-based optimization with a spectral penalty.

Norm Inequalities

∥W∥2​≤∥W∥F​≤rank(W)​∥W∥2​

Explanation: The Frobenius norm is an upper bound on the spectral norm. These inequalities relate average energy to worst-case amplification.

Nuclear Norm

∥W∥∗​=i=1∑min(m,n)​σi​(W)

Explanation: The nuclear norm sums all singular values. It is often used to encourage low rank, unlike spectral regularization which targets only the largest singular value.

Condition Number

κ2​(W)=σmin​(W)σmax​(W)​

Explanation: A high condition number indicates sensitivity to perturbations. Reducing \(σmax​\) helps if \(σmin​\) is fixed.

Projected Gradient Step

W+=Π∥⋅∥2​≤c​(W−η∇L(W))=⎩⎨⎧​W−η∇L(W),∥W−η∇L(W)∥2​c​(W−η∇L(W)),​∥W−η∇L(W)∥2​≤cotherwise​

Explanation: After a gradient step, project back to the spectral-norm ball of radius c. For this ball, projection is a simple rescaling if the norm is too large.

Singular Value Estimate

σ^=u⊤Wv

Explanation: Given approximate top singular vectors u and v, the Rayleigh quotient uT W v estimates the largest singular value. It is cheap to compute once u and v are known.

Complexity Analysis

The dominant computational cost in spectral regularization is estimating the spectral norm of each weight matrix. Using k steps of power iteration on an m×n matrix requires O(k m n) time: each iteration performs one matrix–vector multiply (Wv) costing O(m n) and one transpose multiply (WT u) costing O(m n). Memory usage is O(m n) to store W plus O(m + n) for the working vectors u and v; the extra space is negligible compared to the matrix itself. In a neural network with L layers, applying spectral penalties or projections to all layers multiplies this overhead by L. When training with batches, the spectral-norm computation is independent of the batch size, so the relative overhead is smaller for larger batches. Penalty-based methods add the cost of computing a subgradient, which is simply the outer product u vT in O(m n) time and space to accumulate into the weight gradient. Projection/normalization methods add one rescaling of W, also O(m n). In practice, k is kept small (e.g., k=1–5) and warm-started from the previous iteration’s vectors so that the incremental cost per training step remains modest. Exact SVD would cost O(min(m n2, n m2)) per layer, which is typically prohibitive for large matrices. Therefore, approximate methods are preferred for scalability, trading a small bias in the norm estimate for substantial speed. Overall, with small k, the amortized overhead per step is linear in the number of parameters of each constrained matrix.

Code Examples

Estimating the spectral norm with power iteration (u, v, and σmax)
1#include <bits/stdc++.h>
2using namespace std;
3
4using Matrix = vector<vector<double>>;
5using Vector = vector<double>;
6
7// Multiply matrix W (m x n) by vector x (n)
8Vector matVec(const Matrix& W, const Vector& x) {
9 size_t m = W.size();
10 size_t n = W[0].size();
11 Vector y(m, 0.0);
12 for (size_t i = 0; i < m; ++i) {
13 double s = 0.0;
14 for (size_t j = 0; j < n; ++j) s += W[i][j] * x[j];
15 y[i] = s;
16 }
17 return y;
18}
19
20// Multiply W^T (n x m) by vector u (m)
21Vector matTVec(const Matrix& W, const Vector& u) {
22 size_t m = W.size();
23 size_t n = W[0].size();
24 Vector v(n, 0.0);
25 for (size_t j = 0; j < n; ++j) {
26 double s = 0.0;
27 for (size_t i = 0; i < m; ++i) s += W[i][j] * u[i];
28 v[j] = s;
29 }
30 return v;
31}
32
33double dot(const Vector& a, const Vector& b) {
34 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i] * b[i]; return s;
35}
36
37double norm2(const Vector& x) {
38 return sqrt(max(0.0, dot(x, x)));
39}
40
41void normalize(Vector& x) {
42 double n = norm2(x);
43 if (n < 1e-12) return; // avoid divide by zero
44 for (double& xi : x) xi /= n;
45}
46
47// Power iteration to approximate top singular value and vectors
48// Returns estimated sigma; outputs u and v if provided
49double spectralNormPowerIteration(const Matrix& W, int iters, Vector* u_out, Vector* v_out) {
50 size_t m = W.size();
51 size_t n = W[0].size();
52 // Initialize v with random values
53 std::mt19937 rng(42);
54 std::normal_distribution<double> dist(0.0, 1.0);
55 Vector v(n);
56 for (size_t j = 0; j < n; ++j) v[j] = dist(rng);
57 normalize(v);
58
59 Vector u(m, 0.0);
60 for (int t = 0; t < iters; ++t) {
61 u = matVec(W, v);
62 normalize(u);
63 v = matTVec(W, u);
64 normalize(v);
65 }
66 // Rayleigh quotient estimate of sigma
67 Vector Wv = matVec(W, v);
68 double sigma = norm2(Wv); // equivalent to u^T W v when u~Wv/||Wv||
69
70 if (u_out) *u_out = u;
71 if (v_out) *v_out = v;
72 return sigma;
73}
74
75int main() {
76 // Example matrix (m x n)
77 size_t m = 4, n = 3;
78 Matrix W(m, vector<double>(n, 0.0));
79 // Fill W with some values
80 double val = 0.1;
81 for (size_t i = 0; i < m; ++i)
82 for (size_t j = 0; j < n; ++j)
83 W[i][j] = val += 0.17; // arbitrary nontrivial entries
84
85 Vector u, v;
86 double sigma = spectralNormPowerIteration(W, 20, &u, &v);
87
88 cout << fixed << setprecision(6);
89 cout << "Estimated spectral norm (sigma_max): " << sigma << "\n";
90 cout << "u (top left singular vector): "; for (double x : u) cout << x << ' '; cout << "\n";
91 cout << "v (top right singular vector): "; for (double x : v) cout << x << ' '; cout << "\n";
92 return 0;
93}
94

This program estimates the largest singular value (spectral norm) and its corresponding singular vectors using power iteration. It alternates normalizing Wv and W^T u. The final estimate of σ_max is ||W v||_2, which equals u^T W v when u is aligned with Wv. This avoids the cost of a full SVD.

Time: O(k m n) for k iterations on an m×n matrixSpace: O(m n) for W and O(m + n) extra for u, v, and temporaries
Linear regression with spectral-norm penalty using power-iteration subgradient
1#include <bits/stdc++.h>
2using namespace std;
3
4using Matrix = vector<vector<double>>;
5using Vector = vector<double>;
6
7Vector matVec(const Matrix& W, const Vector& x) {
8 size_t m = W.size(); size_t n = W[0].size(); Vector y(m, 0.0);
9 for (size_t i = 0; i < m; ++i) { double s = 0.0; for (size_t j = 0; j < n; ++j) s += W[i][j]*x[j]; y[i]=s; } return y;
10}
11Vector matTVec(const Matrix& W, const Vector& u) {
12 size_t m = W.size(); size_t n = W[0].size(); Vector v(n, 0.0);
13 for (size_t j = 0; j < n; ++j) { double s = 0.0; for (size_t i = 0; i < m; ++i) s += W[i][j]*u[i]; v[j]=s; } return v;
14}
15Matrix outer(const Vector& a, const Vector& b) {
16 size_t m=a.size(), n=b.size(); Matrix M(m, vector<double>(n));
17 for (size_t i=0;i<m;++i) for (size_t j=0;j<n;++j) M[i][j]=a[i]*b[j]; return M;
18}
19double dot(const Vector& a, const Vector& b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
20double norm2(const Vector& x){ return sqrt(max(0.0, dot(x,x))); }
21void normalize(Vector& x){ double n=norm2(x); if(n<1e-12) return; for(double& xi:x) xi/=n; }
22
23// Power iteration returning sigma and (u, v)
24double spectralNormUV(const Matrix& W, int iters, Vector& u, Vector& v) {
25 size_t m=W.size(), n=W[0].size();
26 if (v.size()!=n) v.assign(n,0.0);
27 if (u.size()!=m) u.assign(m,0.0);
28 static std::mt19937 rng(123);
29 std::normal_distribution<double> dist(0.0,1.0);
30 if (all_of(v.begin(), v.end(), [](double x){return x==0.0;})) {
31 for (size_t j=0;j<n;++j) v[j]=dist(rng);
32 normalize(v);
33 }
34 for (int t=0;t<iters;++t){
35 u = matVec(W, v); normalize(u);
36 v = matTVec(W, u); normalize(v);
37 }
38 Vector Wv = matVec(W, v);
39 double sigma = norm2(Wv);
40 return sigma;
41}
42
43int main(){
44 ios::sync_with_stdio(false);
45 // Synthetic linear regression: y ~ W x + noise
46 size_t in_dim=5, out_dim=3, n_samples=200;
47 Matrix X(n_samples, Vector(in_dim));
48 Matrix Y(n_samples, Vector(out_dim));
49 std::mt19937 rng(7);
50 std::normal_distribution<double> nx(0.0,1.0), nnoise(0.0,0.1);
51
52 // True weights
53 Matrix W_true(out_dim, Vector(in_dim));
54 for (size_t i=0;i<out_dim;++i) for (size_t j=0;j<in_dim;++j) W_true[i][j]= (i+1)*0.4 + (j+1)*0.2;
55
56 // Generate data
57 for (size_t s=0;s<n_samples;++s){
58 for (size_t j=0;j<in_dim;++j) X[s][j]=nx(rng);
59 Vector y = matVec(W_true, X[s]);
60 for (size_t i=0;i<out_dim;++i) Y[s][i] = y[i] + nnoise(rng);
61 }
62
63 // Model weights (to learn)
64 Matrix W(out_dim, Vector(in_dim, 0.0));
65
66 double lr=0.05; double lambda=0.1; int epochs=200; int pow_iters=2; // small k with warm starts
67 Vector u(out_dim,0.0), v(in_dim,0.0); // warm-start vectors for power iteration
68
69 for (int ep=0; ep<epochs; ++ep){
70 // Compute MSE gradient over full dataset
71 Matrix grad(out_dim, Vector(in_dim, 0.0));
72 double mse=0.0;
73 for (size_t s=0;s<n_samples;++s){
74 const Vector& x = X[s];
75 const Vector& y = Y[s];
76 Vector yhat = matVec(W, x);
77 Vector err(out_dim);
78 for (size_t i=0;i<out_dim;++i){ err[i]= yhat[i]-y[i]; mse += err[i]*err[i]; }
79 Matrix g = outer(err, x); // d/dW of (1/2)||Wx - y||^2 = err * x^T
80 for (size_t i=0;i<out_dim;++i)
81 for (size_t j=0;j<in_dim;++j)
82 grad[i][j] += g[i][j];
83 }
84 mse /= n_samples; // average per sample (without 1/2 for simplicity)
85 // Average gradient with factor 2 due to d||e||^2 = 2 e de
86 for (size_t i=0;i<out_dim;++i)
87 for (size_t j=0;j<in_dim;++j)
88 grad[i][j] *= (2.0 / n_samples);
89
90 // Spectral penalty: lambda * ||W||_2, subgradient approx = lambda * u v^T
91 double sigma = spectralNormUV(W, pow_iters, u, v);
92 Matrix pen = outer(u, v);
93 for (size_t i=0;i<out_dim;++i)
94 for (size_t j=0;j<in_dim;++j)
95 grad[i][j] += lambda * pen[i][j];
96
97 // Gradient descent step
98 for (size_t i=0;i<out_dim;++i)
99 for (size_t j=0;j<in_dim;++j)
100 W[i][j] -= lr * grad[i][j];
101
102 if ((ep+1)%50==0 || ep==0){
103 cout << "Epoch " << (ep+1) << ": MSE=" << mse
104 << ", sigma_max(W)~" << sigma << "\n";
105 }
106 }
107
108 // Report final spectral norm
109 Vector uu, vv; double final_sigma = spectralNormUV(W, 10, uu, vv);
110 cout << fixed << setprecision(6) << "Final estimated sigma_max(W): " << final_sigma << "\n";
111 return 0;
112}
113

This example learns a linear map W for regression while penalizing its spectral norm. It estimates the top singular vectors via a few power iterations per epoch (warm-started for speed). The subgradient of the penalty is approximated as u v^T and added to the MSE gradient, encouraging a smaller spectral norm while fitting the data.

Time: Per epoch: O(n_samples * in_dim * out_dim) for MSE gradient + O(k * out_dim * in_dim) for power iterationSpace: O(out_dim * in_dim) for W and gradients, plus O(out_dim + in_dim) for u and v
Projection: enforce ||W||_2 ≤ c by rescaling (spectral normalization as a hard constraint)
1#include <bits/stdc++.h>
2using namespace std;
3
4using Matrix = vector<vector<double>>;
5using Vector = vector<double>;
6
7Vector matVec(const Matrix& W, const Vector& x) { size_t m=W.size(), n=W[0].size(); Vector y(m,0.0); for(size_t i=0;i<m;++i){ double s=0.0; for(size_t j=0;j<n;++j) s+=W[i][j]*x[j]; y[i]=s;} return y; }
8Vector matTVec(const Matrix& W, const Vector& u) { size_t m=W.size(), n=W[0].size(); Vector v(n,0.0); for(size_t j=0;j<n;++j){ double s=0.0; for(size_t i=0;i<m;++i) s+=W[i][j]*u[i]; v[j]=s;} return v; }
9double dot(const Vector& a, const Vector& b){ double s=0.0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; }
10double norm2(const Vector& x){ return sqrt(max(0.0, dot(x,x))); }
11void normalize(Vector& x){ double n=norm2(x); if(n<1e-12) return; for(double& xi:x) xi/=n; }
12
13double spectralNorm(const Matrix& W, int iters){ size_t m=W.size(), n=W[0].size(); Vector v(n,0.0), u(m,0.0); std::mt19937 rng(9); std::normal_distribution<double> d(0.0,1.0); for(size_t j=0;j<n;++j) v[j]=d(rng); normalize(v); for(int t=0;t<iters;++t){ u=matVec(W,v); normalize(u); v=matTVec(W,u); normalize(v);} Vector Wv=matVec(W,v); return norm2(Wv); }
14
15void projectSpectralNorm(Matrix& W, double c, int iters=3){
16 double sigma = spectralNorm(W, iters);
17 if (sigma > c && sigma > 1e-12){
18 double scale = c / sigma;
19 for (auto& row : W) for (double& w : row) w *= scale;
20 }
21}
22
23int main(){
24 size_t m=6,n=4; Matrix W(m, vector<double>(n));
25 std::mt19937 rng(1); std::normal_distribution<double> d(0.0,1.0);
26 for(size_t i=0;i<m;++i) for(size_t j=0;j<n;++j) W[i][j]=d(rng);
27 double c=1.5;
28
29 double before = spectralNorm(W, 15);
30 projectSpectralNorm(W, c, 5);
31 double after = spectralNorm(W, 15);
32
33 cout << fixed << setprecision(6)
34 << "Before projection, sigma_max(W)~" << before << "\n"
35 << "After projection, sigma_max(W)~" << after << " (target c=" << c << ")\n";
36 return 0;
37}
38

This snippet enforces a hard spectral-norm constraint by rescaling W if its estimated spectral norm exceeds c. It demonstrates the projection step commonly used in projected gradient methods and spectral normalization layers.

Time: O(k m n) per projection (k power iterations on an m×n matrix)Space: O(m n) for W and O(m + n) extra for vectors
#spectral regularization#spectral norm#power iteration#singular value#lipschitz constant#spectral normalization#projected gradient#frobenius norm#nuclear norm#subgradient#wgan#adversarial robustness#operator norm#condition number