📚TheoryAdvanced

GAN Theory

Key Points

•
Generative Adversarial Networks (GANs) set up a two-player game where a generator tries to make fake samples that look real while a discriminator tries to tell real from fake.
•
The classic GAN objective is a minimax game whose optimal discriminator is D*(x) = $p_{d} a t a$ (x) / ( $p_{d} a t a$ (x) + $p_{g}$ (x)).
•
With the optimal discriminator, training the generator is equivalent to minimizing the Jensen–Shannon (JS) divergence between the real and generated distributions.
•
Training can be unstable due to vanishing gradients and mode collapse; the non-saturating generator loss and careful optimization help.
•
Wasserstein GANs replace JS divergence with Earth Mover (Wasserstein-1) distance, leading to smoother gradients and better stability.
•
WGANs need a 1-Lipschitz critic; this can be enforced via weight clipping, gradient penalty, or spectral normalization.
•
Spectral normalization scales weight matrices by their largest singular value to control the Lipschitz constant.
•
There is a tradeoff between generating very sharp samples (quality) and covering all modes of the data (coverage).

Prerequisites

→Probability distributions and expectations — GAN objectives are expectations over data and noise; understanding sampling and Monte Carlo estimation is essential.
→Gradient-based optimization (SGD/Adam) and chain rule — Training involves alternating gradient steps and backpropagation through networks.
→Neural network basics (layers, activations, loss functions) — GANs use neural networks for both generator and discriminator/critic.
→Divergences and distances (KL, JS, Wasserstein) — Theoretical understanding of what GANs minimize and why WGAN improves stability relies on these concepts.
→Lipschitz continuity and regularization — WGAN requires a 1-Lipschitz critic; techniques like spectral normalization enforce it.
→Linear algebra (singular values, spectral norm) — Spectral normalization depends on estimating the largest singular value of weight matrices.
→Minimax optimization and game dynamics — GAN training is a two-player game; dynamics and equilibria inform stability strategies.
→Monte Carlo methods — GAN losses are estimated by sampling minibatches from data and noise distributions.

Detailed Explanation

Tap terms for definitions

01Overview

Generative Adversarial Networks (GANs) are a framework for learning to generate realistic data by pitting two neural networks against each other in a game. The generator G maps random noise to data-like samples, while the discriminator D judges whether an input is real (from the dataset) or fake (from the generator). The classical GAN objective is a minimax problem: the discriminator aims to maximize its classification accuracy, and the generator aims to minimize it by producing indistinguishable samples. At equilibrium, if both players have enough capacity and are trained to optimality, the generator distribution matches the real data distribution. GANs are powerful for image, audio, and text synthesis, but training is notoriously delicate. The learning dynamics can oscillate, gradients can vanish (especially when the discriminator is too strong), and the generator may collapse to a few modes (mode collapse). Several improvements address these issues: the non-saturating generator loss provides stronger gradients; Wasserstein GANs (WGAN) replace the Jensen–Shannon divergence with the Earth Mover distance; and Lipschitz constraints on the critic using techniques like weight clipping, gradient penalty, or spectral normalization improve stability. Practical recipes—such as balanced update frequencies, careful learning rates, and architectural choices—are crucial for success.

02Intuition & Analogies

Imagine a talented counterfeiter (the generator) and a skilled detective (the discriminator). The counterfeiter makes fake bills; the detective tries to spot fakes. After each round, both learn: the detective notes the tells of the fakes, and the counterfeiter improves based on what fooled or failed. Over time, the counterfeiter’s bills become more convincing, and the detective becomes sharper. The game keeps going until the fakes are so good that even the detective can’t reliably tell them apart. In classic GANs, the detective assigns a probability D(x) that a bill is real. The counterfeiter learns to produce bills that raise that probability. If the detective becomes near-perfect too quickly, the counterfeiter gets little feedback (vanishing gradients)—like receiving the unhelpful comment “it’s obviously fake,” without clues why. The non-saturating loss is like giving the counterfeiter richer hints even when the detective is confident. Switching to a Wasserstein GAN changes the metric of “how fake” something is. Instead of a yes/no probability, the critic scores inputs so that real samples have higher scores than fake ones. The difference in average scores approximates the minimal “work” to transform fake into real—the Earth Mover distance. This is smoother: even if fakes are far from real, the critic still provides useful gradients pointing the generator in the right direction. To keep this scoring meaningful, the critic must be 1-Lipschitz—think of it as limiting how steep the critic’s scoring function can be, preventing it from overreacting to small changes and thereby stabilizing training.

03Formal Definition

A classical GAN defines a minimax problem between a generator G(z;

θ_{g}

) that maps noise z

\sim

p_{z}

to data space and a discriminator D(x;

θ_{d}

)

\in

(0,1) estimating the probability that x is real: \[

min_{G}

max_{D}

V(D,G) =

E

_{x

\sim

p_{

data

}} [

lo g

D(x)] +

E_{z \sim p_{z}}

[

lo g

(1 - D(G(z)))] . \] For fixed G, the optimal discriminator is \( D^*(x) =

\frac

{p_{

data

}(x)}{p_{

data

}(x) +

p_{g}

(x)} \), where

p_{g}

is the distribution induced by G. Substituting D^* yields \[

min_{G}

V(D^*,G) = -

lo g

4 + 2\,

JS

(p_{

data

}

∥

p_{g}

), \] so optimizing G minimizes the Jensen–Shannon divergence. In practice, the generator often uses the non-saturating loss \(

max_{G}

E_{z}

[

lo g

D(G(z))] \) to avoid vanishing gradients. Wasserstein GANs replace the JS divergence with the Wasserstein-1 (Earth Mover) distance: \[ W(p_{

data

p_{g}

) =

sup_{∥ f ∥_{L} \leq 1}

E

_{x

\sim

p_{

data

}}[f(x)] -

E_{x \sim p_{g}}

[f(x)] , \] where the supremum is over 1-Lipschitz functions f (the critic). Enforcing the Lipschitz constraint can be done via weight clipping, gradient penalty, or spectral normalization. The generator minimizes \( -

E_{z}

[f(G(z))] \), receiving informative gradients even when supports are disjoint. More broadly, GAN variants can be shown to minimize f-divergences \(

D_{f}

∥

Q) =

\int

q(x) f\!

\left

(

\tfrac{p(x)}{q(x)}\right

) dx \) under suitable choices of discriminator families and losses.

04When to Use

Use GANs when you need high-fidelity sample generation or style transfer and have access to large datasets: image synthesis (faces, scenes), super-resolution, inpainting, data augmentation, and domain translation (e.g., horses↔zebras). They shine when you value visual or perceptual quality and can tolerate less direct likelihood modeling. Prefer the classic GAN loss when training is relatively stable (e.g., small models, well-curated datasets) and you can leverage tricks like non-saturating loss, label smoothing, and balanced updates. Switch to WGAN (with gradient penalty or spectral normalization) if training collapses or gradients vanish—especially when real and generated distributions have limited overlap early in training. For applications where coverage matters (avoiding missing modes), monitor diversity metrics and consider techniques like minibatch discrimination, unrolled GANs, or curriculum/progressive growing to improve coverage. If enforcing Lipschitz continuity is critical (e.g., WGAN or theoretical guarantees), use spectral normalization—it is lightweight and effective for convolutional and linear layers. For very high-resolution image generation, progressive growing stabilizes training by starting with small images and incrementally adding layers as training proceeds. In resource-constrained or prototyping scenarios, start with a toy 1D or low-dimensional GAN to understand dynamics (as in the C++ examples), then scale to full neural networks using a deep learning library’s C++ API (e.g., LibTorch) once the principles are clear.

⚠️Common Mistakes

• Using the saturating generator loss and a too-strong discriminator, leading to vanishing gradients. Prefer the non-saturating loss or WGAN to maintain useful gradients. • Forgetting to enforce Lipschitz constraints in WGANs. Without clipping, gradient penalty, or spectral normalization, the critic can explode, invalidating the Wasserstein estimate and destabilizing training. • Updating one player far more than the other or using mismatched learning rates, causing oscillations or collapse. Tune update ratios (e.g., 1–5 critic steps per generator step), monitor losses, and adjust schedules. • Training the discriminator on generator outputs while letting gradients flow back during the discriminator update (i.e., not detaching). This leaks gradients in unintended ways and corrupts alternating optimization semantics. Ensure proper separation between updates. • Misinterpreting GAN losses as absolute quality metrics. Discriminator/critic losses are not directly comparable across runs or architectures; use sample quality/diversity metrics and qualitative inspection. • Over-regularizing the discriminator with batch normalization in WGAN-GP, which can conflict with gradient penalties. Prefer layer norm, instance norm, or spectral normalization in the critic. • Neglecting mode coverage: a generator that produces stunning but repetitive images is still failing. Use batch-wise diversity checks, minibatch features, or explicit penalties to detect and reduce mode collapse. • Implementing spectral normalization incorrectly (e.g., too few power iterations or not updating the singular vectors per step), which weakens the Lipschitz control. Run at least 1–2 power iterations per update and cache vectors.

Key Formulas

GAN Minimax Objective

G min D max E_{x \sim p_{data}} [lo g D (x)] + E_{z \sim p_{z}} [lo g (1 - D (G (z)))]

Explanation: The discriminator maximizes correct classification of real vs. fake; the generator minimizes it by producing confusable samples. This sets up the adversarial game.

Optimal Discriminator

D^{*} (x) = \frac{p _{data} ( x )}{p _{data} ( x ) + p _{g} ( x )}

Explanation: Given a fixed generator distribution $p_{g}$ , the best discriminator outputs the ratio of real probability to the total of real and fake probabilities. This achieves the maximum for the discriminator.

JS Divergence Connection

G min V (D^{*}, G) = - lo g 4 + 2 JS (p_{data} ∥ p_{g})

Explanation: Plugging the optimal discriminator into the GAN objective reveals that training the generator minimizes the Jensen–Shannon divergence between real and generated distributions up to a constant.

Non-saturating Generator Loss

G max E_{z \sim p_{z}} [lo g D (G (z))]

Explanation: This alternative generator objective avoids saturation and provides stronger gradients when the discriminator is strong, improving training stability.

Primal Wasserstein-1 Distance

W (P, Q) = γ \in Π (P, Q) in f E_{(x, y) \sim γ} [∥ x - y ∥_{2}]

Explanation: The Earth Mover distance is the minimum average cost to transport mass to turn Q into P, over all couplings γ with marginals P and Q. It reflects geometric closeness of distributions.

Kantorovich–Rubinstein Duality

W (P, Q) = ∥ f ∥_{L} \leq 1 sup E_{x \sim P} [f (x)] - E_{x \sim Q} [f (x)]

Explanation: The Wasserstein-1 distance equals the maximum difference in expectations over all 1-Lipschitz functions. WGANs approximate this by training a 1-Lipschitz critic.

Gradient Penalty

L_{GP} = λ E_{\overset{x}{^} \sim P_{\overset{x}{^}}} (∥ \nabla_{\overset{x}{^}} f (\overset{x}{^}) ∥_{2} - 1)^{2}

Explanation: This regularizer encourages the critic’s gradient norm to be 1 on samples interpolated between real and fake points, approximately enforcing the Lipschitz constraint.

Spectral Normalization

W_{SN} = \frac{W}{σ _{m a x} ( W )}

Explanation: Dividing a weight matrix by its largest singular value controls the layer’s Lipschitz constant, aiding stability in WGANs and classic GANs alike.

f-divergence

D_{f} (P ∥ Q) = \int q (x) f (\frac{p ( x )}{q ( x )}) d x

Explanation: A general divergence family parameterized by a convex function f. Many GAN variants can be viewed as minimizing an f-divergence under appropriate discriminator choices.

Complexity Analysis

Let N be the batch size,

P_{D}

and

P_{G}

the number of parameters in discriminator/critic and generator, respectively, and

C_{f} w d

C_{b} w d

the per-sample cost of forward/backward evaluations. A single classic GAN step typically performs k discriminator updates and one generator update per iteration. Each discriminator update costs O(N · (

C_{f} w d

^D +

C_{b} w d

^D)) ≈ O(N ·

P_{D}

) for dense layers, and each generator update costs O(N · (

C_{f} w d

^G +

C_{b} w d

^G +

C_{f} w d

^D)) because generator gradients backpropagate through D, yielding ≈ O(N · (

P_{G}

P_{D}

)). Thus, per iteration, time is O(k · N ·

P_{D}

+ N · (

P_{G}

P_{D}

)). For WGANs, the same scaling holds, but with a different loss; adding gradient penalty introduces an extra backward pass through input gradients, adding roughly O(N ·

P_{D}

) to the critic update (often a 1.5–2× factor for the critic step). Spectral normalization per layer uses 1–2 power iterations. Each iteration multiplies by W and

W^{T}

, costing O(mn) for an m×n weight matrix. If done for L layers, the overhead per step is O(∑_ℓ m_ℓ n_ℓ), typically a small fraction of the full backprop cost. Memory scales with parameters: O(

P_{D}

P_{G}

) for weights plus O(N · activations) during training (to store intermediates for backprop). For convolutional GANs, activation memory dominates. Toy 1D C++ examples have constant parameter counts (O(1)) and linear cost in batch size and steps: O(T · N) time and O(N) space for samples. In full networks, practical throughput is dominated by matrix–tensor multiplications on GPUs; algorithmic choices (k critic steps, gradient penalty, spectral normalization) trade compute for improved stability and convergence quality.

Code Examples

Toy 1D Classic GAN: Logistic Discriminator and Non-saturating Generator Loss

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Sigmoid helper
5 static inline double sigmoid(double x) { return 1.0 / (1.0 + exp(-x)); }
6 
7 // Sample from N(mean, std^2)
8 double sample_normal(double mean, double stddev, mt19937 &rng) {
9     normal_distribution<double> dist(mean, stddev);
10     return dist(rng);
11 }
12 
13 int main() {
14     ios::sync_with_stdio(false);
15     cin.tie(nullptr);
16 
17     // Real data distribution: x ~ N(mu_real, sigma^2)
18     const double mu_real = 2.0;
19     const double sigma = 1.0;
20 
21     // Generator: y = mu_g + z, with z ~ N(0, 1). Parameter: mu_g.
22     double mu_g = -2.0; // initialize far from real mean
23 
24     // Discriminator: D(x) = sigmoid(a * x + b)
25     double a = 0.0, b = 0.0; // parameters for discriminator
26 
27     // Hyperparameters
28     const int steps = 800;          // training iterations
29     const int batch_size = 256;     // Monte Carlo batch
30     const double lr_d = 1e-2;       // discriminator learning rate (ascent)
31     const double lr_g = 5e-3;       // generator learning rate (ascent on non-saturating objective)
32 
33     mt19937 rng(42);
34 
35     for (int t = 1; t <= steps; ++t) {
36         // 1) Update discriminator by ascending its objective:
37         //    E_real[log D(x)] + E_fake[log(1 - D(y))].
38         double grad_a = 0.0, grad_b = 0.0;
39         for (int i = 0; i < batch_size; ++i) {
40             double x = sample_normal(mu_real, sigma, rng);
41             double y = mu_g + sample_normal(0.0, 1.0, rng);
42             double Dx = sigmoid(a * x + b);
43             double Dy = sigmoid(a * y + b);
44             // Gradients derived analytically:
45             // d/da log D(x) = (1 - D(x)) * x;    d/db log D(x) = (1 - D(x))
46             grad_a += (1.0 - Dx) * x;
47             grad_b += (1.0 - Dx);
48             // d/da log(1 - D(y)) = -D(y) * y;    d/db log(1 - D(y)) = -D(y)
49             grad_a += -Dy * y;
50             grad_b += -Dy;
51         }
52         grad_a /= batch_size;
53         grad_b /= batch_size;
54         // Ascent step for discriminator
55         a += lr_d * grad_a;
56         b += lr_d * grad_b;
57 
58         // 2) Update generator using non-saturating loss: maximize E[log D(G(z))]
59         double grad_mu = 0.0;
60         for (int i = 0; i < batch_size; ++i) {
61             double z = sample_normal(0.0, 1.0, rng);
62             double y = mu_g + z;
63             double Dy = sigmoid(a * y + b);
64             // d/dmu log D(y) = a * (1 - D(y))
65             grad_mu += a * (1.0 - Dy);
66         }
67         grad_mu /= batch_size;
68         // Ascent on generator objective
69         mu_g += lr_g * grad_mu;
70 
71         if (t % 50 == 0) {
72             // Report training status
73             // Estimate objectives via Monte Carlo
74             double d_obj = 0.0, g_obj = 0.0;
75             int eval_bs = 1000;
76             for (int i = 0; i < eval_bs; ++i) {
77                 double x = sample_normal(mu_real, sigma, rng);
78                 double y = mu_g + sample_normal(0.0, 1.0, rng);
79                 double Dx = sigmoid(a * x + b);
80                 double Dy = sigmoid(a * y + b);
81                 d_obj += log(max(1e-12, Dx)) + log(max(1e-12, 1.0 - Dy));
82                 g_obj += log(max(1e-12, Dy));
83             }
84             d_obj /= eval_bs;
85             g_obj /= eval_bs;
86             cout << "Step " << t
87                  << " | mu_g= " << fixed << setprecision(4) << mu_g
88                  << " | a= " << a << " b= " << b
89                  << " | D-obj= " << d_obj << " | G-obj(ns)= " << g_obj << "\n";
90         }
91     }
92 
93     cout << "\nFinal generator mean mu_g ≈ " << fixed << setprecision(4) << mu_g
94          << " (target mu_real = " << mu_real << ")\n";
95     return 0;
96 }
97

This standalone C++ program trains a toy 1D GAN. The real data are N(2, 1). The generator is a shifted Gaussian with learnable mean μ_g. The discriminator is logistic regression D(x) = σ(ax + b). We alternate: ascend the discriminator objective and ascend the non-saturating generator objective E[log D(G(z))]. Analytical gradients avoid any deep learning library. You should see μ_g move toward the real mean, illustrating adversarial learning and the benefit of the non-saturating loss.

Time: O(steps × batch_size)Space: O(batch_size)

Toy 1D Wasserstein GAN (WGAN) with Weight Clipping

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Sample from N(mean, std^2)
5 double sample_normal(double mean, double stddev, mt19937 &rng) {
6     normal_distribution<double> dist(mean, stddev);
7     return dist(rng);
8 }
9 
10 int main() {
11     ios::sync_with_stdio(false);
12     cin.tie(nullptr);
13 
14     // Real distribution: x ~ N(mu_real, 1)
15     const double mu_real = 3.0;
16 
17     // Generator: y = mu_g + z, z ~ N(0,1)
18     double mu_g = -3.0;
19 
20     // Critic: f(x) = a * x  (bias omitted since it cancels in expectation)
21     double a = 0.0;
22 
23     // Hyperparameters
24     const int steps = 600;
25     const int critic_steps = 5;      // number of critic updates per generator update
26     const int batch_size = 256;
27     const double lr_c = 5e-3;        // critic learning rate (ascent)
28     const double lr_g = 5e-3;        // generator learning rate (descent on -E[f(G(z))])
29     const double clip = 1.0;         // weight clipping for Lipschitz (|a| <= clip)
30 
31     mt19937 rng(123);
32 
33     for (int t = 1; t <= steps; ++t) {
34         for (int cs = 0; cs < critic_steps; ++cs) {
35             // Maximize: E[f(x_real)] - E[f(x_fake)]
36             double grad_a = 0.0;
37             for (int i = 0; i < batch_size; ++i) {
38                 double xr = sample_normal(mu_real, 1.0, rng);
39                 double xf = mu_g + sample_normal(0.0, 1.0, rng);
40                 grad_a += xr - xf; // df/da accumulates x for real and -x for fake
41             }
42             grad_a /= batch_size;
43             a += lr_c * grad_a;     // ascent step
44             // Enforce 1-Lipschitz by clipping
45             a = max(-clip, min(clip, a));
46         }
47 
48         // Generator update: minimize -E[f(G(z))] ≡ maximize E[f(G(z))]
49         double grad_mu = 0.0;
50         for (int i = 0; i < batch_size; ++i) {
51             double z = sample_normal(0.0, 1.0, rng);
52             double y = mu_g + z;
53             // f(y) = a * y, so d/dmu E[f(y)] = a * d/dmu E[y] = a
54             grad_mu += a; // since dy/dmu = 1
55         }
56         grad_mu /= batch_size;
57         // We maximize E[f(G(z))], so ascend mu_g by lr_g * grad_mu
58         mu_g += lr_g * grad_mu;
59 
60         if (t % 30 == 0) {
61             // Estimate Wasserstein gap E[f(real)] - E[f(fake)]
62             int eval_bs = 2000;
63             double erf = 0.0, eff = 0.0;
64             for (int i = 0; i < eval_bs; ++i) {
65                 double xr = sample_normal(mu_real, 1.0, rng);
66                 double xf = mu_g + sample_normal(0.0, 1.0, rng);
67                 erf += a * xr;
68                 eff += a * xf;
69             }
70             erf /= eval_bs; eff /= eval_bs;
71             cout << "Step " << t
72                  << " | mu_g= " << fixed << setprecision(4) << mu_g
73                  << " | a= " << a
74                  << " | W_est ≈ " << (erf - eff) << "\n";
75         }
76     }
77 
78     cout << "\nFinal mu_g ≈ " << fixed << setprecision(4) << mu_g
79          << " (target mu_real = " << 3.0 << ")\n";
80     return 0;
81 }
82

This C++ program implements a 1D WGAN. The critic is linear f(x) = a x, trained with weight clipping to enforce the 1-Lipschitz constraint. The critic maximizes E[f(real)] − E[f(fake)], and the generator maximizes E[f(fake)] (equivalently minimizes −E[f(fake)]). The generator mean moves toward the real mean, and the estimated Wasserstein gap decreases as distributions align. This illustrates why WGAN provides stable gradients even when real and fake distributions are initially far apart.

Time: O(steps × critic_steps × batch_size)Space: O(batch_size)

Spectral Normalization via Power Iteration

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Simple matrix utilities for spectral norm estimation via power iteration
5 struct Matrix {
6     int m, n; // m rows, n cols
7     vector<double> a; // row-major size m*n
8     Matrix(int m_, int n_) : m(m_), n(n_), a(m_*n_, 0.0) {}
9     double &operator()(int i, int j) { return a[i*n + j]; }
10     double operator()(int i, int j) const { return a[i*n + j]; }
11 };
12 
13 vector<double> matvec(const Matrix &W, const vector<double> &x) {
14     vector<double> y(W.m, 0.0);
15     for (int i = 0; i < W.m; ++i) {
16         double s = 0.0;
17         for (int j = 0; j < W.n; ++j) s += W(i, j) * x[j];
18         y[i] = s;
19     }
20     return y;
21 }
22 
23 vector<double> matTvec(const Matrix &W, const vector<double> &x) {
24     vector<double> y(W.n, 0.0);
25     for (int j = 0; j < W.n; ++j) {
26         double s = 0.0;
27         for (int i = 0; i < W.m; ++i) s += W(i, j) * x[i];
28         y[j] = s;
29     }
30     return y;
31 }
32 
33 double norm2(const vector<double> &x) {
34     double s = 0.0; for (double v : x) s += v * v; return sqrt(max(1e-18, s));
35 }
36 
37 // Estimate largest singular value sigma_max(W) via power iteration
38 // Returns sigma and also outputs the left/right singular vectors if desired.
39 double spectral_norm(const Matrix &W, int iters = 20) {
40     std::mt19937 rng(2024);
41     std::uniform_real_distribution<double> unif(-1.0, 1.0);
42     vector<double> u(W.m), v(W.n);
43     for (int i = 0; i < W.m; ++i) u[i] = unif(rng);
44     double nu = norm2(u); for (double &x : u) x /= nu;
45     for (int t = 0; t < iters; ++t) {
46         // v ← (W^T u) / ||W^T u||
47         v = matTvec(W, u);
48         double nv = norm2(v); for (double &x : v) x /= nv;
49         // u ← (W v) / ||W v||
50         u = matvec(W, v);
51         nu = norm2(u); for (double &x : u) x /= nu;
52     }
53     // Rayleigh quotient approximation: sigma ≈ ||W v||
54     vector<double> Wv = matvec(W, v);
55     return norm2(Wv);
56 }
57 
58 int main() {
59     ios::sync_with_stdio(false);
60     cin.tie(nullptr);
61 
62     // Create a random 3x3 weight matrix
63     Matrix W(3, 3);
64     mt19937 rng(7);
65     normal_distribution<double> nd(0.0, 1.0);
66     for (int i = 0; i < 3; ++i)
67         for (int j = 0; j < 3; ++j)
68             W(i, j) = nd(rng);
69 
70     double sigma = spectral_norm(W, 30);
71     cout << "Estimated spectral norm (sigma_max) = " << fixed << setprecision(6) << sigma << "\n";
72 
73     // Apply spectral normalization: W_sn = W / sigma
74     Matrix W_sn = W;
75     for (int i = 0; i < 3; ++i)
76         for (int j = 0; j < 3; ++j)
77             W_sn(i, j) /= sigma;
78 
79     // Verify: spectral norm of W_sn should be close to 1
80     double sigma_sn = spectral_norm(W_sn, 30);
81     cout << "After spectral normalization, sigma_max ≈ " << sigma_sn << "\n";
82 
83     return 0;
84 }
85

This program estimates the largest singular value of a weight matrix using power iteration and applies spectral normalization (divide by σ_max). Spectral normalization controls a layer’s Lipschitz constant, which is crucial in WGANs and generally stabilizes GAN training by preventing excessively steep discriminators/critics.

Time: O(iters × m × n)Space: O(m × n)

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Sigmoid helper
5	static inline double sigmoid(double x) { return 1.0 / (1.0 + exp(-x)); }
6
7	// Sample from N(mean, std^2)
8	double sample_normal(double mean, double stddev, mt19937 &rng) {
9	normal_distribution<double> dist(mean, stddev);
10	return dist(rng);
11	}
12
13	int main() {
14	ios::sync_with_stdio(false);
15	cin.tie(nullptr);
16
17	// Real data distribution: x ~ N(mu_real, sigma^2)
18	const double mu_real = 2.0;
19	const double sigma = 1.0;
20
21	// Generator: y = mu_g + z, with z ~ N(0, 1). Parameter: mu_g.
22	double mu_g = -2.0; // initialize far from real mean
23
24	// Discriminator: D(x) = sigmoid(a * x + b)
25	double a = 0.0, b = 0.0; // parameters for discriminator
26
27	// Hyperparameters
28	const int steps = 800; // training iterations
29	const int batch_size = 256; // Monte Carlo batch
30	const double lr_d = 1e-2; // discriminator learning rate (ascent)
31	const double lr_g = 5e-3; // generator learning rate (ascent on non-saturating objective)
32
33	mt19937 rng(42);
34
35	for (int t = 1; t <= steps; ++t) {
36	// 1) Update discriminator by ascending its objective:
37	// E_real[log D(x)] + E_fake[log(1 - D(y))].
38	double grad_a = 0.0, grad_b = 0.0;
39	for (int i = 0; i < batch_size; ++i) {
40	double x = sample_normal(mu_real, sigma, rng);
41	double y = mu_g + sample_normal(0.0, 1.0, rng);
42	double Dx = sigmoid(a * x + b);
43	double Dy = sigmoid(a * y + b);
44	// Gradients derived analytically:
45	// d/da log D(x) = (1 - D(x)) * x; d/db log D(x) = (1 - D(x))
46	grad_a += (1.0 - Dx) * x;
47	grad_b += (1.0 - Dx);
48	// d/da log(1 - D(y)) = -D(y) * y; d/db log(1 - D(y)) = -D(y)
49	grad_a += -Dy * y;
50	grad_b += -Dy;
51	}
52	grad_a /= batch_size;
53	grad_b /= batch_size;
54	// Ascent step for discriminator
55	a += lr_d * grad_a;
56	b += lr_d * grad_b;
57
58	// 2) Update generator using non-saturating loss: maximize E[log D(G(z))]
59	double grad_mu = 0.0;
60	for (int i = 0; i < batch_size; ++i) {
61	double z = sample_normal(0.0, 1.0, rng);
62	double y = mu_g + z;
63	double Dy = sigmoid(a * y + b);
64	// d/dmu log D(y) = a * (1 - D(y))
65	grad_mu += a * (1.0 - Dy);
66	}
67	grad_mu /= batch_size;
68	// Ascent on generator objective
69	mu_g += lr_g * grad_mu;
70
71	if (t % 50 == 0) {
72	// Report training status
73	// Estimate objectives via Monte Carlo
74	double d_obj = 0.0, g_obj = 0.0;
75	int eval_bs = 1000;
76	for (int i = 0; i < eval_bs; ++i) {
77	double x = sample_normal(mu_real, sigma, rng);
78	double y = mu_g + sample_normal(0.0, 1.0, rng);
79	double Dx = sigmoid(a * x + b);
80	double Dy = sigmoid(a * y + b);
81	d_obj += log(max(1e-12, Dx)) + log(max(1e-12, 1.0 - Dy));
82	g_obj += log(max(1e-12, Dy));
83	}
84	d_obj /= eval_bs;
85	g_obj /= eval_bs;
86	cout << "Step " << t
87	<< " \| mu_g= " << fixed << setprecision(4) << mu_g
88	<< " \| a= " << a << " b= " << b
89	<< " \| D-obj= " << d_obj << " \| G-obj(ns)= " << g_obj << "\n";
90	}
91	}
92
93	cout << "\nFinal generator mean mu_g ≈ " << fixed << setprecision(4) << mu_g
94	<< " (target mu_real = " << mu_real << ")\n";
95	return 0;
96	}
97

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Simple matrix utilities for spectral norm estimation via power iteration
5	struct Matrix {
6	int m, n; // m rows, n cols
7	vector<double> a; // row-major size m*n
8	Matrix(int m_, int n_) : m(m_), n(n_), a(m_*n_, 0.0) {}
9	double &operator()(int i, int j) { return a[i*n + j]; }
10	double operator()(int i, int j) const { return a[i*n + j]; }
11	};
12
13	vector<double> matvec(const Matrix &W, const vector<double> &x) {
14	vector<double> y(W.m, 0.0);
15	for (int i = 0; i < W.m; ++i) {
16	double s = 0.0;
17	for (int j = 0; j < W.n; ++j) s += W(i, j) * x[j];
18	y[i] = s;
19	}
20	return y;
21	}
22
23	vector<double> matTvec(const Matrix &W, const vector<double> &x) {
24	vector<double> y(W.n, 0.0);
25	for (int j = 0; j < W.n; ++j) {
26	double s = 0.0;
27	for (int i = 0; i < W.m; ++i) s += W(i, j) * x[i];
28	y[j] = s;
29	}
30	return y;
31	}
32
33	double norm2(const vector<double> &x) {
34	double s = 0.0; for (double v : x) s += v * v; return sqrt(max(1e-18, s));
35	}
36
37	// Estimate largest singular value sigma_max(W) via power iteration
38	// Returns sigma and also outputs the left/right singular vectors if desired.
39	double spectral_norm(const Matrix &W, int iters = 20) {
40	std::mt19937 rng(2024);
41	std::uniform_real_distribution<double> unif(-1.0, 1.0);
42	vector<double> u(W.m), v(W.n);
43	for (int i = 0; i < W.m; ++i) u[i] = unif(rng);
44	double nu = norm2(u); for (double &x : u) x /= nu;
45	for (int t = 0; t < iters; ++t) {
46	// v ← (W^T u) / \|\|W^T u\|\|
47	v = matTvec(W, u);
48	double nv = norm2(v); for (double &x : v) x /= nv;
49	// u ← (W v) / \|\|W v\|\|
50	u = matvec(W, v);
51	nu = norm2(u); for (double &x : u) x /= nu;
52	}
53	// Rayleigh quotient approximation: sigma ≈ \|\|W v\|\|
54	vector<double> Wv = matvec(W, v);
55	return norm2(Wv);
56	}
57
58	int main() {
59	ios::sync_with_stdio(false);
60	cin.tie(nullptr);
61
62	// Create a random 3x3 weight matrix
63	Matrix W(3, 3);
64	mt19937 rng(7);
65	normal_distribution<double> nd(0.0, 1.0);
66	for (int i = 0; i < 3; ++i)
67	for (int j = 0; j < 3; ++j)
68	W(i, j) = nd(rng);
69
70	double sigma = spectral_norm(W, 30);
71	cout << "Estimated spectral norm (sigma_max) = " << fixed << setprecision(6) << sigma << "\n";
72
73	// Apply spectral normalization: W_sn = W / sigma
74	Matrix W_sn = W;
75	for (int i = 0; i < 3; ++i)
76	for (int j = 0; j < 3; ++j)
77	W_sn(i, j) /= sigma;
78
79	// Verify: spectral norm of W_sn should be close to 1
80	double sigma_sn = spectral_norm(W_sn, 30);
81	cout << "After spectral normalization, sigma_max ≈ " << sigma_sn << "\n";
82
83	return 0;
84	}
85