šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
šŸ“šTheoryIntermediate

Loss Landscape Analysis

Key Points

  • •
    A loss landscape is the ā€œterrainā€ of a model’s loss as you move through parameter space; valleys are good solutions and peaks are bad ones.
  • •
    We analyze landscapes using low-dimensional slices 2D1D​ around a trained point to visualize curvature, flatness, and saddles.
  • •
    Flat minima often correlate with better generalization; sharp minima can fit training data but fail on unseen data.
  • •
    Practical analysis includes grid evaluations, line searches, gradient/Hessian-based curvature estimates, and random-direction projections.
  • •
    Correct scaling and normalization of directions are crucial so that axes are comparable and plots are meaningful.
  • •
    Loss landscapes are high-dimensional, so any 2D plot is only a projection and can hide or exaggerate features.
  • •
    You can prototype analysis in C++ by evaluating loss on grids and along directions and by estimating top Hessian eigenvalues.
  • •
    Compute and export CSV data from C++ and plot with external tools (Python/Matplotlib, gnuplot) for clear visualizations.

Prerequisites

  • →Multivariable calculus (gradients and Hessians) — Loss landscape geometry relies on understanding derivatives and curvature.
  • →Linear algebra (vectors, matrices, eigenvalues) — Directions, orthonormalization, and Hessian eigen-analysis require these tools.
  • →Optimization basics (gradient descent) — Landscape analysis interprets and diagnoses the behavior of optimizers.
  • →Probability and statistics — Loss functions are empirical risks; noise and generalization require statistical thinking.
  • →Logistic and linear regression — Provide simple, convex examples where landscapes can be computed exactly.
  • →Numerical stability and scaling — Normalization of directions and safe evaluations prevent misleading plots.
  • →Data handling and file I/O in C++ — We export CSV grids from C++ for visualization.
  • →Plotting tools (external) — Visualizing the exported CSV with heatmaps/contours completes the analysis.

Detailed Explanation

Tap terms for definitions

01Overview

Hook: Imagine hiking in fog across unknown mountains. Your goal is to find the lowest valley, but you can only feel the slope beneath your feet. That’s what optimization feels like when training machine learning models. Concept: The loss landscape is the function that maps every possible setting of a model’s parameters to a number measuring how badly the model performs. Visualizing or analyzing this surface helps us understand why optimization methods succeed or get stuck, and why some solutions generalize better than others. Because modern models have millions of parameters, the full surface is impossible to draw, but we can still study meaningful 1D and 2D slices, curvature, and local geometry. Example: After training a logistic regression classifier, we can pick two random directions in parameter space and evaluate the loss on a small grid around the found parameters. The resulting heatmap might reveal a wide, flat basin (good) or a sharp pit (risky for generalization).

02Intuition & Analogies

Hook: Picture a drone scanning a landscape: smooth rolling hills are easy to traverse, while jagged cliffs and narrow ravines are dangerous and unpredictable. Training a model is like flying that drone using only local slope (the gradient) information. Concept: Flat valleys in the terrain mean the loss doesn’t change much if you move a little—so the solution is robust to small perturbations, often implying better generalization. Sharp valleys (or spikes) mean tiny moves can worsen performance drastically; they usually arise from overfitting or overly aggressive training. Saddles are like passes between mountains: the slope is zero, but you can go down in some directions and up in others, confusing simple optimizers. Because we can’t see in 1,000,000 dimensions, we use projections: walk along one direction (1D line) or a plane spanned by two directions (2D) and record the loss. Normalizing these directions is like choosing consistent step sizes so that the map scale is fair. Example: Take parameters Īø at the end of training. Choose a random unit vector d and plot loss versus α in L(Īø + αd). If the curve is shallow near α=0, the minimum is flat; if it spikes quickly, the minimum is sharp. Repeat with two orthonormal directions d1 and d2 to get a 2D heatmap that looks like a basin (flat) or a crater (sharp).

03Formal Definition

Hook: We can move from pictures to precise math by defining the landscape and its local geometry. Concept: Let L: Rp → R map parameters Īø ∈ Rp to a scalar loss. The gradient āˆ‡ L(Īø) indicates the steepest ascent, and the Hessian H(Īø) = āˆ‡2 L(Īø) describes local curvature. Critical points satisfy āˆ‡ L(Īøāˆ—) = 0 and are categorized by Hessian eigenvalues: all positive (local minimum), all negative (local maximum), or mixed signs (saddle). For visualization, define 1D and 2D slices: f(α) = L(Īø0​ + α d) and g(α,β) = L(Īø0​ + α d1​ + β d2​), where d, d1​, d2​ are normalized directions (often orthonormal). Example: Using a second-order Taylor expansion, L(Īø0​+Ī“) ā‰ˆ L(Īø0​) + āˆ‡ L(Īø0​)^{⊤} Ī“ + 21​ Γ⊤ H(Īø0​) Ī“. Near a minimum where āˆ‡ L ā‰ˆ 0, curvature is dominated by the quadratic term and governed by Hessian eigenvalues; the largest eigenvalue approximates the sharpest ascent of loss.

04When to Use

Hook: If training stalls, overfits, or behaves erratically, the shape of the landscape can reveal why. Concept: Use loss landscape analysis to diagnose optimization issues (plateaus, saddles), compare optimizers (SGD vs. Adam), tune regularization and learning rates, and assess robustness. It’s also useful in research to understand why certain architectures (e.g., residual networks) are easier to optimize and why flat minima often generalize better. Visual patterns—wide basins vs. needle-like pits—inform model and hyperparameter choices. Example: • During hyperparameter tuning, plot 1D loss curves along random directions around the final solution; if curves are spiky, lower the learning rate or increase weight decay. • When comparing two checkpoints, draw a linear interpolation curve L((1-t)\theta_{A} + t\theta_{B}) to see if the path is barrier-free (compatible minima) or has a peak (different basins). • For a small logistic regression or MLP, compute the top Hessian eigenvalue to quantify sharpness numerically.

āš ļøCommon Mistakes

Hook: Pretty plots can mislead if produced carelessly. Concept: Common pitfalls include (1) unnormalized directions, which distort axes and exaggerate curvature; (2) too coarse grids that miss narrow structures; (3) relying solely on training loss, ignoring validation loss; (4) interpreting a single 2D slice as the whole story; (5) stochastic noise from mini-batches masking true geometry; and (6) parameter symmetries (like neuron permutations or scale invariances) that create deceptive flatness or ridges. Example: • If you plot g(\alpha,\beta) without orthonormalizing d1 and d2, the heatmap may look skewed, falsely suggesting anisotropy. • Using a small batch for evaluation adds noise; recompute loss on the full dataset for clean plots. • Two networks with BatchNorm have scale invariances; compare landscapes only after applying appropriate normalization so axes correspond to meaningful perturbation sizes.

Key Formulas

Empirical risk

L(Īø)=n1​i=1āˆ‘n​ℓ(fθ​(xi​),yi​)

Explanation: This defines the loss as the average of per-example losses over the training set. It is the primary surface we study during optimization.

Gradient and Hessian

āˆ‡L(Īø)=[āˆ‚Īø1ā€‹āˆ‚L​,…,āˆ‚Īøpā€‹āˆ‚L​]⊤,H(Īø)=āˆ‡2L(Īø)

Explanation: The gradient gives the direction of steepest increase of loss, while the Hessian encodes curvature in all directions. They determine local geometry near any point.

Second-order Taylor approximation

L(Īø+Ī“)ā‰ˆL(Īø)+āˆ‡L(Īø)⊤Γ+21ā€‹Ī“āŠ¤H(Īø)Ī“

Explanation: Near a point, loss changes are approximated by a linear term plus a quadratic curvature term. At minima where the gradient is small, curvature dominates local behavior.

1D/2D slices

f(α)=L(Īø0​+αd),g(α,β)=L(Īø0​+αd1​+βd2​)

Explanation: These restrict the high-dimensional surface to a line or plane using normalized directions. They make visualization and interpretation feasible.

Spectral sharpness

Ī»max​(H(Īø))=∄u∄2​=1max​u⊤H(Īø)u

Explanation: The largest eigenvalue of the Hessian equals the maximum quadratic curvature along any unit direction. It is a principled measure of sharpness.

\epsilon-sharpness

Sϵ​(Īø)=∄Γ∄2​≤ϵmax​(L(Īø+Ī“)āˆ’L(Īø))ā‰ˆ21​ϵ2Ī»max​(H(Īø))

Explanation: The worst-case loss increase within a small ball scales with the top Hessian eigenvalue. This connects curvature to robustness of a solution.

Gram–Schmidt for directions

d~1​=∄d1ā€‹āˆ„2​d1​​,d~2​=∄d2ā€‹āˆ’(d~1āŠ¤ā€‹d2​)d~1ā€‹āˆ„2​d2ā€‹āˆ’(d~1āŠ¤ā€‹d2​)d~1​​

Explanation: This orthonormalizes two random directions so that axes in the 2D slice are perpendicular and scaled equally. It avoids distortions in the heatmap.

Linear regression MSE

ā„“MSE​(Īø)=n1​i=1āˆ‘n​(yiā€‹āˆ’(wxi​+b))2

Explanation: For 1D inputs, the loss is the mean squared error between true targets and line predictions. Its landscape over slope and intercept is a convex bowl.

Logistic loss

ā„“logistic​(Īø)=n1​i=1āˆ‘n​[āˆ’yi​logσ(zi​)āˆ’(1āˆ’yi​)log(1āˆ’Ļƒ(zi​))],zi​=w⊤xi​+b

Explanation: Binary cross-entropy for logistic regression defines a convex surface in parameters. It is useful for illustrating slices and curvature analytically.

Logistic regression Hessian

Hlogistic​(Īø)=n1​i=1āˆ‘nā€‹Ļƒ(zi​)(1āˆ’Ļƒ(zi​))xi​xiāŠ¤ā€‹

Explanation: The Hessian equals a data-weighted covariance where weights are p(1-p). It is positive semi-definite, making the loss convex.

Complexity Analysis

Evaluating a loss landscape typically multiplies the base cost of computing loss by the number of probe points. If computing the loss on n examples with p parameters costs O(nĀ·p) (common for linear and logistic models with dense features), then a 2D grid with G Ɨ G points costs O(G2 Ā· n Ā· p). Memory is usually dominated by storing the data O(nĀ·pd​ata) and the grid of results O(G2). When p is large, even forming gradients or Hessians exactly may be expensive: gradients are O(nĀ·p), while full Hessians are O(nĀ·p2) to compute and O(p2) to store, which is infeasible for modern deep networks. In such cases, we prefer Hessian–vector products or finite-difference approximations that avoid materializing H explicitly, reducing per product to roughly the cost of a gradient, O(nĀ·p). For 1D slices with R sample points, the cost is O(R Ā· n Ā· p), which scales linearly in the number of evaluations and is thus tractable for moderate R (e.g., 100–200). Power iteration to estimate the largest Hessian eigenvalue requires K iterations, each dominated by an HĀ·v computation. For logistic regression with explicit Hessian, this is O(nĀ·p + p2) per iteration (forming weighted design products), yielding total O(KĀ·(nĀ·p + p2)); if using implicit Hessian–vector products, it becomes O(KĀ·nĀ·p) with only O(p) memory. In practice, choose grid sizes and iteration counts to balance resolution and runtime (e.g., G in [51, 101], K in [20, 100]).

Code Examples

2D loss surface for linear regression (slope vs. intercept) and CSV export
1#include <bits/stdc++.h>
2using namespace std;
3
4// Compute Mean Squared Error for y = w*x + b on dataset (x[i], y[i]).
5double mse_loss(const vector<double>& x, const vector<double>& y, double w, double b) {
6 const int n = (int)x.size();
7 double sumsq = 0.0;
8 for (int i = 0; i < n; ++i) {
9 double pred = w * x[i] + b;
10 double diff = y[i] - pred;
11 sumsq += diff * diff;
12 }
13 return sumsq / n;
14}
15
16int main() {
17 ios::sync_with_stdio(false);
18 cin.tie(nullptr);
19
20 // 1) Generate synthetic linear data y = 3x + 2 + noise
21 int n = 200;
22 vector<double> x(n), y(n);
23 std::mt19937 rng(42);
24 std::uniform_real_distribution<double> ux(-2.0, 2.0);
25 std::normal_distribution<double> noise(0.0, 0.4);
26 for (int i = 0; i < n; ++i) {
27 x[i] = ux(rng);
28 y[i] = 3.0 * x[i] + 2.0 + noise(rng);
29 }
30
31 // 2) Define a grid over (w, b)
32 int G = 101; // grid resolution per axis
33 double w_min = 0.0, w_max = 6.0;
34 double b_min = -1.0, b_max = 5.0;
35
36 // 3) Evaluate loss on grid and write CSV for plotting
37 ofstream out("linear_surface.csv");
38 out << "w,b,loss\n";
39
40 double best_w = 0, best_b = 0, best_loss = numeric_limits<double>::infinity();
41
42 for (int i = 0; i < G; ++i) {
43 double w = w_min + (w_max - w_min) * i / (G - 1);
44 for (int j = 0; j < G; ++j) {
45 double b = b_min + (b_max - b_min) * j / (G - 1);
46 double L = mse_loss(x, y, w, b);
47 out << w << "," << b << "," << L << "\n";
48 if (L < best_loss) {
49 best_loss = L; best_w = w; best_b = b;
50 }
51 }
52 }
53 out.close();
54
55 cerr << fixed << setprecision(6);
56 cerr << "Best on grid: w=" << best_w << ", b=" << best_b << ", loss=" << best_loss << "\n";
57 cerr << "CSV written to linear_surface.csv (columns: w,b,loss). Plot with your preferred tool." << "\n";
58
59 return 0;
60}
61

This program creates a simple 1D regression dataset and evaluates the mean squared error on a 2D grid of slope (w) and intercept (b). It writes a CSV file for external visualization, revealing a convex bowl-shaped surface. The grid minimum approximates the optimal least-squares parameters.

Time: O(G^2 Ā· n) for grid evaluation (each loss is O(n)).Space: O(n) to store the dataset plus O(1) for streaming CSV (no grid kept in memory).
2D slice of logistic regression loss around a trained solution using orthonormal random directions
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Example { vector<double> x; int y; };
5
6// Sigmoid function
7static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
8
9// Compute logistic (binary cross-entropy) loss given parameters theta = [w1, w2, ..., b]
10double logistic_loss(const vector<Example>& data, const vector<double>& theta) {
11 int p = (int)theta.size() - 1; // last element is bias b
12 double loss = 0.0;
13 for (const auto& e : data) {
14 double z = theta.back(); // bias
15 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
16 double p1 = sigmoid(z);
17 p1 = min(max(p1, 1e-12), 1.0 - 1e-12); // numeric safety
18 loss += - (e.y ? log(p1) : log(1.0 - p1));
19 }
20 return loss / data.size();
21}
22
23// Gradient of logistic loss
24vector<double> logistic_grad(const vector<Example>& data, const vector<double>& theta) {
25 int p = (int)theta.size() - 1;
26 vector<double> g(p + 1, 0.0);
27 for (const auto& e : data) {
28 double z = theta.back();
29 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
30 double p1 = sigmoid(z);
31 double diff = (p1 - e.y); // derivative of BCE wrt z
32 for (int j = 0; j < p; ++j) g[j] += diff * e.x[j];
33 g.back() += diff; // bias term
34 }
35 for (double& v : g) v /= data.size();
36 return g;
37}
38
39// Train logistic regression via gradient descent
40vector<double> train_logreg(const vector<Example>& data, int p, int iters=2000, double lr=0.5) {
41 vector<double> theta(p + 1, 0.0); // initialize to zeros
42 for (int t = 0; t < iters; ++t) {
43 vector<double> g = logistic_grad(data, theta);
44 double eta = lr / sqrt(1.0 + t * 0.01); // mild decay
45 for (int j = 0; j <= p; ++j) theta[j] -= eta * g[j];
46 }
47 return theta;
48}
49
50// Orthonormalize two random directions using Gram-Schmidt
51pair<vector<double>, vector<double>> two_orthonormal_directions(int dim, std::mt19937& rng) {
52 std::normal_distribution<double> N(0.0, 1.0);
53 vector<double> d1(dim), d2(dim);
54 for (int i = 0; i < dim; ++i) { d1[i] = N(rng); d2[i] = N(rng); }
55 auto norm = [](const vector<double>& v){ double s=0; for(double x:v) s+=x*x; return sqrt(max(1e-18, s)); };
56 auto dot = [](const vector<double>& a, const vector<double>& b){ double s=0; for(size_t i=0;i<a.size();++i) s+=a[i]*b[i]; return s; };
57
58 // Normalize d1
59 double n1 = norm(d1); for (double& x : d1) x /= n1;
60 // Make d2 orthogonal to d1
61 double proj = dot(d2, d1);
62 for (int i = 0; i < dim; ++i) d2[i] -= proj * d1[i];
63 // Normalize d2
64 double n2 = norm(d2); for (double& x : d2) x /= n2;
65 return {d1, d2};
66}
67
68int main() {
69 ios::sync_with_stdio(false);
70 cin.tie(nullptr);
71
72 // 1) Create a separable 2D dataset (two Gaussian blobs)
73 std::mt19937 rng(123);
74 int n_per_class = 200;
75 vector<Example> data; data.reserve(2 * n_per_class);
76 normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0);
77 normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0);
78 for (int i = 0; i < n_per_class; ++i) {
79 data.push_back({{A1(rng), A2(rng)}, 1});
80 data.push_back({{B1(rng), B2(rng)}, 0});
81 }
82
83 // 2) Train logistic regression (parameters: w1, w2, b)
84 int p = 2;
85 vector<double> theta = train_logreg(data, p, 1500, 0.4);
86 cerr << fixed << setprecision(6);
87 cerr << "Trained loss = " << logistic_loss(data, theta) << "\n";
88
89 // 3) Build two orthonormal directions in R^3 around theta
90 auto [d1, d2] = two_orthonormal_directions(p + 1, rng);
91
92 // 4) Evaluate 2D slice g(alpha, beta) = L(theta + alpha*d1 + beta*d2)
93 int G = 101; // grid resolution per axis
94 double r = 1.5; // radius along each direction
95 ofstream out("logreg_slice.csv");
96 out << "alpha,beta,loss\n";
97 for (int i = 0; i < G; ++i) {
98 double alpha = -r + 2*r * i / (G - 1);
99 for (int j = 0; j < G; ++j) {
100 double beta = -r + 2*r * j / (G - 1);
101 vector<double> th = theta;
102 for (int k = 0; k <= p; ++k) th[k] += alpha * d1[k] + beta * d2[k];
103 double L = logistic_loss(data, th);
104 out << alpha << "," << beta << "," << L << "\n";
105 }
106 }
107 out.close();
108 cerr << "2D slice written to logreg_slice.csv (columns: alpha,beta,loss)." << "\n";
109
110 return 0;
111}
112

We generate two Gaussian clusters, train a 2D logistic regression, and then probe the landscape by evaluating a 2D slice around the trained parameters along two orthonormal directions. The resulting CSV can be plotted as a heatmap or contour plot, revealing local flatness or sharpness of the found solution.

Time: Training: O(iters Ā· n Ā· p). Slice: O(G^2 Ā· n Ā· p).Space: O(n Ā· p) to store data, O(p) for parameters and directions.
Estimating sharpness via top Hessian eigenvalue (logistic regression) using power iteration
1#include <bits/stdc++.h>
2using namespace std;
3
4struct Example { vector<double> x; int y; };
5
6static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
7
8double logistic_loss(const vector<Example>& data, const vector<double>& theta) {
9 int p = (int)theta.size() - 1; double loss = 0.0;
10 for (const auto& e : data) {
11 double z = theta.back();
12 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
13 double p1 = sigmoid(z);
14 p1 = min(max(p1, 1e-12), 1.0 - 1e-12);
15 loss += - (e.y ? log(p1) : log(1.0 - p1));
16 }
17 return loss / data.size();
18}
19
20// Compute Hessian for logistic regression explicitly: H = (1/n) sum p(1-p) x x^T (including bias as an extra feature 1)
21vector<vector<double>> logistic_hessian(const vector<Example>& data, const vector<double>& theta) {
22 int p = (int)theta.size() - 1; // features + bias
23 int d = p + 1; // augmented dimension including bias
24 vector<vector<double>> H(d, vector<double>(d, 0.0));
25 for (const auto& e : data) {
26 // augmented feature a = [x; 1]
27 vector<double> a(d);
28 for (int j = 0; j < p; ++j) a[j] = e.x[j];
29 a[p] = 1.0;
30 double z = theta[p];
31 for (int j = 0; j < p; ++j) z += theta[j] * e.x[j];
32 double p1 = sigmoid(z);
33 double w = p1 * (1.0 - p1);
34 for (int i = 0; i < d; ++i) {
35 for (int j = 0; j < d; ++j) {
36 H[i][j] += w * a[i] * a[j];
37 }
38 }
39 }
40 double invn = 1.0 / data.size();
41 for (auto& row : H) for (double& x : row) x *= invn;
42 return H;
43}
44
45// Power iteration to estimate largest eigenvalue of symmetric matrix H
46pair<double, vector<double>> top_eigenpair(const vector<vector<double>>& H, int iters=100) {
47 int d = (int)H.size();
48 std::mt19937 rng(777);
49 std::normal_distribution<double> N(0.0, 1.0);
50 vector<double> v(d);
51 for (int i = 0; i < d; ++i) v[i] = N(rng);
52 auto norm2 = [](const vector<double>& x){ double s=0; for(double v:x) s+=v*v; return sqrt(max(1e-18, s)); };
53
54 for (int t = 0; t < iters; ++t) {
55 vector<double> Hv(d, 0.0);
56 for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j];
57 double nrm = norm2(Hv);
58 for (int i = 0; i < d; ++i) v[i] = Hv[i] / nrm;
59 }
60 // Rayleigh quotient as eigenvalue estimate
61 double num = 0.0, den = 0.0;
62 vector<double> Hv(d, 0.0);
63 for (int i = 0; i < d; ++i) for (int j = 0; j < d; ++j) Hv[i] += H[i][j] * v[j];
64 for (int i = 0; i < d; ++i) { num += v[i] * Hv[i]; den += v[i] * v[i]; }
65 double lambda = num / den;
66 return {lambda, v};
67}
68
69int main(){
70 ios::sync_with_stdio(false);
71 cin.tie(nullptr);
72
73 // Create simple dataset
74 std::mt19937 rng(42);
75 int n = 400;
76 normal_distribution<double> A1( 2.0, 1.0), A2( 2.0, 1.0);
77 normal_distribution<double> B1(-2.0, 1.0), B2(-2.0, 1.0);
78 vector<Example> data; data.reserve(n);
79 for (int i = 0; i < n/2; ++i) data.push_back({{A1(rng), A2(rng)}, 1});
80 for (int i = 0; i < n/2; ++i) data.push_back({{B1(rng), B2(rng)}, 0});
81
82 // Train logistic regression quickly (few iterations suffice)
83 auto grad = [&](const vector<double>& theta){
84 int p = (int)theta.size() - 1; vector<double> g(p+1,0.0);
85 for (const auto& e : data) {
86 double z = theta[p]; for (int j=0;j<p;++j) z += theta[j]*e.x[j];
87 double p1 = 1.0/(1.0+exp(-z)); double d = (p1 - e.y);
88 for (int j=0;j<p;++j) g[j]+=d*e.x[j]; g[p]+=d;
89 }
90 for (double& v: g) v /= data.size(); return g; };
91
92 int p = 2; vector<double> theta(p+1,0.0);
93 double lr = 0.4; int iters = 800;
94 for (int t=0;t<iters;++t){ auto g = grad(theta); double eta = lr/sqrt(1.0+t*0.01); for(int j=0;j<=p;++j) theta[j]-=eta*g[j]; }
95
96 // Compute Hessian and estimate sharpness
97 auto H = logistic_hessian(data, theta);
98 auto [lambda_max, v] = top_eigenpair(H, 100);
99
100 cerr << fixed << setprecision(6);
101 cerr << "Trained loss = " << logistic_loss(data, theta) << "\n";
102 cerr << "Estimated top Hessian eigenvalue (sharpness) = " << lambda_max << "\n";
103
104 // Optional: predict epsilon-sharpness via quadratic approximation
105 double eps = 0.5; double s_eps = 0.5 * eps * eps * lambda_max;
106 cerr << "Approx. epsilon-sharpness for eps=0.5 is " << s_eps << "\n";
107
108 return 0;
109}
110

This program trains a small logistic regression model, forms its exact Hessian (feasible at small dimension), and estimates the largest eigenvalue using power iteration. The top eigenvalue quantifies local curvature (sharpness) and predicts the worst-case loss increase within a small radius via the quadratic approximation.

Time: Training: O(iters Ā· n Ā· p). Hessian: O(n Ā· p^2). Power iteration: O(K Ā· p^2).Space: O(n Ā· p) for data, O(p^2) to store the Hessian.
#loss landscape#sharpness#hessian eigenvalues#gradient descent#logistic regression#mean squared error#2d slice#orthonormal directions#gram-schmidt#power iteration#generalization#flat minima#saddle points#csv export#visualization