∑MathIntermediate

Hessian Matrix

Key Points

•
The Hessian matrix collects all second-order partial derivatives of a scalar function and measures local curvature.
•
For twice continuously differentiable functions, the Hessian is symmetric by Clairaut’s theorem.
•
The quadratic term $p^{T}$ H p in the second-order Taylor expansion predicts how fast the function bends along direction p.
•
Positive definite Hessians indicate local minima, negative definite indicate local maxima, and indefinite indicate saddle points.
•
In optimization, Newton’s method uses the Hessian to choose curvature-aware steps that often converge very fast near the optimum.
•
Numerically approximating a full n×n Hessian with central differences typically requires O( $n^{2}$ ) function evaluations.
•
Eigenvalues of the Hessian describe principal curvatures; signs tell you the type of critical point and magnitudes tell you how sharp it is.
•
Hessian–vector products can be computed without forming H explicitly, enabling scalable second-order methods.

Prerequisites

→Multivariable calculus (limits, partial derivatives, chain rule) — The Hessian is built from second-order partial derivatives; understanding derivatives and continuity is essential.
→Linear algebra (vectors, matrices, eigenvalues) — The Hessian is a symmetric matrix; interpreting curvature requires eigenvalues, quadratic forms, and solving linear systems.
→Numerical analysis and floating-point arithmetic — Finite-difference approximation accuracy and stability depend on step sizes and round-off behavior.
→Unconstrained optimization basics — Using Hessians in Newton/trust-region methods requires understanding descent directions and line search/trust-region globalization.

Detailed Explanation

Tap terms for definitions

01Overview

The Hessian matrix is a square matrix of all second-order partial derivatives of a real-valued function. If you have a function f that maps points from n-dimensional space to real numbers, the Hessian summarizes how the slope (gradient) changes as you move in different directions. While the gradient tells you the direction of steepest ascent, the Hessian tells you how that direction itself curves—how the function bends. This makes the Hessian central to understanding local shape: whether you are in a valley (minimum), on a hilltop (maximum), or on a saddle. In optimization and machine learning, the Hessian guides powerful algorithms like Newton’s method and quasi-Newton methods that exploit curvature to make faster progress than methods that only use gradients. In numerical analysis, one often approximates Hessians with finite differences when analytic derivatives are not available. The Hessian also appears in physics (stability analysis), computer vision (blob/ridge detection), and statistics (Fisher information approximations). Importantly, when f is smooth enough (C^2), the Hessian is symmetric, enabling efficient linear algebra and meaningful eigenvalue interpretations that connect directly to convexity, stability, and step-size selection.

02Intuition & Analogies

Imagine hiking on a landscape described by a height function f(x, y). The gradient tells you which way is uphill right now. But the gradient alone cannot tell you whether the terrain is gently sloping or sharply curved. The Hessian fills that gap: it’s like a curvature compass that says how the slope changes as you step in any direction. If you step along some direction p, the term p^T H p tells you how quickly the slope increases or decreases—positive values mean the ground curves upward like a bowl, negative means it curves downward like a dome, and values of different signs in different directions indicate a saddle. Another analogy: think of driving on a road where the steering wheel angle is the gradient. The Hessian measures how much you need to turn the steering wheel more or less as you move—how rapidly the steering demand changes. In finance, consider the price of an option as a function of multiple risks; the Hessian entries quantify how sensitive the sensitivity is—the curvature of price with respect to pairs of inputs. In machine learning, loss functions over parameters form a high-dimensional landscape; the Hessian’s eigenvalues show which directions are flat (small eigenvalues) or stiff (large ones). This helps decide step sizes and whether you might oscillate (negative curvature) if you move too aggressively. In short, while the gradient answers “which way?”, the Hessian answers “how does that way change as you go?”, letting you predict and control motion more precisely.

03Formal Definition

Let f:

R^{n}

\to

R

be twice differentiable. The Hessian matrix H(x) at point x

\in

R^{n}

is the n

\times

n matrix defined by

H_{ij}

(x) =

\frac{\partial ^{2} f ( x )}{\partial x _{i} \partial x _{j}}

. If f

\in

C^{2}

, then by Clairaut–Schwarz theorem the mixed partials commute and H(x) is symmetric:

H_{ij}

(x) =

H_{ji}

(x). The second-order Taylor expansion around x for a small step p is f(x + p)

\approx

f(x) +

\nabla

f(x)^{

⊤

} p +

\frac{1}{2}

p^{⊤}

H(x) p, where

\nabla

f is the gradient. The quadratic form q(p) =

p^{⊤}

H(x) p quantifies curvature along direction p. Spectral properties of H characterize local shape: if all eigenvalues are positive (H

≻

0), f is locally strictly convex and x is a strict local minimizer if

\nabla

f(x)=0; if all are negative (H

≺

0), it is a strict local maximizer; if signs are mixed, x is a saddle point. In optimization, Newton’s step solves H(x) p = -

\nabla

f(x), yielding p = -H(x)^{-1}

\nabla

f(x) when H is nonsingular. When f is convex, H(x) is positive semidefinite for all x; strong convexity corresponds to H(x)

⪰

m I for some

m > 0

04When to Use

Use the Hessian when you need to understand or exploit curvature. In unconstrained optimization, Newton’s method and its variants leverage H to achieve quadratic convergence near a solution, drastically reducing iterations compared to gradient descent. In machine learning, analyzing Hessian eigenvalues helps diagnose sharp vs. flat minima, choose trust-region sizes, or precondition gradient methods. In statistics and econometrics, the observed Hessian (or its expectation, Fisher information) provides standard errors via curvature of the log-likelihood. In numerical PDEs and physics, Hessians arise in stability analysis: positive definiteness of an energy functional’s Hessian signals stable equilibria. In computer vision, the Hessian of an image intensity function is used to detect ridges and blobs (e.g., via determinant or eigenvalues). Practically, compute or approximate Hessians when: (1) you need fast local convergence near a minimizer, (2) you must classify a critical point, (3) you require uncertainty quantification from curvature, or (4) you want to approximate function behavior by a quadratic model. When n is large, prefer Hessian–vector products (avoiding explicit H) or quasi-Newton updates (e.g., BFGS) that approximate curvature with lower cost.

⚠️Common Mistakes

• Ignoring symmetry: For C^2 functions, H should be symmetric; numerical approximations can break symmetry due to round-off. Symmetrize by averaging with its transpose. • Bad finite-difference step sizes: Steps that are too small amplify floating-point noise; too large induce truncation error. A common heuristic is h_i = \sqrt{\epsilon} \max(1, |x_i|). • Misclassifying critical points: Checking only the determinant in 2D is not enough when det = 0; you must consider eigenvalues or higher-order terms. • Forgetting domain issues: Hessians assume interior points where partials exist; at kinks or constraints, second derivatives may not exist or be irrelevant (use subgradients or KKT conditions). • Solving Newton steps without safeguards: If H is not positive definite, pure Newton can diverge. Use line search, damping, or modify H (e.g., add \lambda I) to ensure descent. • Mixing units/scales: Highly scaled variables can cause ill-conditioned Hessians and unstable linear solves; rescale features or precondition. • Assuming convexity from a single Hessian: Positive definiteness at one point does not imply global convexity unless H(x) \succeq 0 for all x. • Overlooking Hessian–vector tricks: Forming H explicitly is O(n^2) memory; use Hv products with iterative solvers when n is large.

Key Formulas

Hessian Definition

H_{ij} (x) = \frac{\partial ^{2} f ( x )}{\partial x _{i} \partial x _{j}}

Explanation: Each entry is a second-order partial derivative. Arranging these entries yields the Hessian matrix H(x) that captures curvature.

Symmetry (Clairaut–Schwarz)

H_{ij} (x) = H_{ji} (x) if f \in C^{2}

Explanation: If mixed partials are continuous, their order does not matter. This makes the Hessian symmetric and enables eigen-decomposition.

Second-Order Taylor Expansion

f (x + p) \approx f (x) + \nabla f (x)^{⊤} p + \frac{1}{2} p^{⊤} H (x) p

Explanation: This quadratic approximation predicts how f changes for small steps p. The Hessian’s quadratic form is the curvature term.

Quadratic Form

p^{⊤} H (x) p = i = 1 \sum n j = 1 \sum n H_{ij} (x) p_{i} p_{j}

Explanation: This expands curvature along direction p in coordinates. The sign and magnitude reveal bowl/dome/flatness along p.

Convexity via Hessian

H (x) ⪰ 0 \forall x ⟺ f is convex (for C^{2} functions)

Explanation: A twice-differentiable function is convex exactly when its Hessian is positive semidefinite everywhere. Nonnegativity of $p^{⊤}$ Hp for all p encodes this.

Newton Step

p_{Newton} = - H (x)^{- 1} \nabla f (x)

Explanation: The Newton direction solves H p = -g. When H is positive definite near a minimizer, this step often gives rapid (quadratic) convergence.

Second Derivative (Diagonal) Finite Difference

f_{x_{i} x_{i}} (x) \approx \frac{f ( x + h e _{i} ) - 2 f ( x ) + f ( x - h e _{i} )}{h ^{2}}

Explanation: A central difference approximation for the second derivative along axis i. Accuracy is second order in h.

Mixed Partial Finite Difference

f_{x_{i} x_{j}} (x) \approx \frac{f ( x + h e _{i} + h e _{j} ) - f ( x + h e _{i} - h e _{j} ) - f ( x - h e _{i} + h e _{j} ) + f ( x - h e _{i} - h e _{j} )}{4 h ^{2}}

Explanation: A symmetric stencil to approximate cross derivatives. Requires four function evaluations per (i,j) pair, yielding second-order accuracy.

2×2 Hessian Eigenvalues

λ_{1, 2} = \frac{tr ( H ) \pm tr ( H ) ^{2} - 4 det ( H )}{2} (n = 2)

Explanation: For 2D, eigenvalues depend on the trace and determinant. Their signs classify the critical point type.

Hessian–Vector Product

H (x) v = ϵ \to 0 lim \frac{\nabla f ( x + ϵ v ) - \nabla f ( x )}{ϵ}

Explanation: The directional derivative of the gradient equals H times v. This identity enables computing Hv without forming H explicitly.

Complexity Analysis

Computing and storing a full Hessian for a function f:

R^{n}

-> R costs O(

n^{2}

) memory. If you approximate H with central finite differences, the diagonal entries need 2 evaluations per coordinate (plus the baseline f(x)), and each off-diagonal entry needs 4 evaluations. Counting unique entries in the symmetric matrix, that is 2n + 4 * n(n-1)/2 = 2n + 2n(n-1) = 2

n^{2}

function evaluations asymptotically, hence O(

n^{2}

) evaluations. Each evaluation’s cost depends on f; if f costs

C_{f}

time, the total is O(

n^{2}

C_{f}

). Building gradients numerically with central differences costs O(n

C_{f}

). In Newton-type optimization, per iteration work includes: (1) computing g = ∇f(x) (O(n

C_{f}

) if numerical), (2) computing H (O(

n^{2}

C_{f}

) if numerical), and (3) solving the linear system H p = -g. A dense direct solve via Gaussian elimination with partial pivoting costs O(

n^{3}

) time and O(

n^{2}

) space. Therefore, for moderate n, the linear solve often dominates; for very expensive f, derivative evaluations may dominate. When n is large, forming H explicitly is prohibitive; instead, one uses Hessian–vector products H v (which can be computed in O(

C_{g} r a d

) time where

C_{g} r a d

is the cost of a gradient evaluation) with iterative solvers like Conjugate Gradient, reducing memory from O(

n^{2}

) to O(n). Numerical stability considerations impact practical performance: ill-conditioned Hessians slow linear solves and degrade finite-difference accuracy; regularization (adding

λ I),

scaling, and line search/trust regions help ensure robust convergence.

Code Examples

Numerical Gradient and Hessian (Central Differences) with a Rosenbrock Demo

1 #include <iostream>
2 #include <vector>
3 #include <functional>
4 #include <cmath>
5 #include <limits>
6 #include <iomanip>
7 #include <algorithm>
8 
9 // Compute central-difference gradient
10 std::vector<double> numerical_gradient(
11     const std::function<double(const std::vector<double>&)>& f,
12     const std::vector<double>& x) {
13     const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
14     int n = (int)x.size();
15     std::vector<double> g(n, 0.0);
16     std::vector<double> xp = x, xm = x;
17     for (int i = 0; i < n; ++i) {
18         double hi = eps * std::max(1.0, std::abs(x[i]));
19         xp[i] = x[i] + hi;
20         xm[i] = x[i] - hi;
21         double fp = f(xp);
22         double fm = f(xm);
23         g[i] = (fp - fm) / (2.0 * hi);
24         xp[i] = xm[i] = x[i];
25     }
26     return g;
27 }
28 
29 // Compute central-difference Hessian (symmetric) with O(n^2) evaluations
30 std::vector<std::vector<double>> numerical_hessian(
31     const std::function<double(const std::vector<double>&)>& f,
32     const std::vector<double>& x) {
33     const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
34     int n = (int)x.size();
35     std::vector<std::vector<double>> H(n, std::vector<double>(n, 0.0));
36     std::vector<double> xp = x, xm = x, xpp = x, xpm = x, xmp = x, xmm = x;
37 
38     double fx = f(x); // baseline value
39 
40     // Diagonal second derivatives: f(x+h e_i) - 2 f(x) + f(x-h e_i)
41     for (int i = 0; i < n; ++i) {
42         double hi = eps * std::max(1.0, std::abs(x[i]));
43         xp[i] = x[i] + hi;
44         xm[i] = x[i] - hi;
45         double fp = f(xp);
46         double fm = f(xm);
47         H[i][i] = (fp - 2.0 * fx + fm) / (hi * hi);
48         xp[i] = xm[i] = x[i];
49     }
50 
51     // Off-diagonal mixed partials using 4-point stencil (i < j)
52     for (int i = 0; i < n; ++i) {
53         double hi = eps * std::max(1.0, std::abs(x[i]));
54         for (int j = i + 1; j < n; ++j) {
55             double hj = eps * std::max(1.0, std::abs(x[j]));
56             // x + hi e_i + hj e_j
57             xpp[i] = x[i] + hi; xpp[j] = x[j] + hj;
58             // x + hi e_i - hj e_j
59             xpm[i] = x[i] + hi; xpm[j] = x[j] - hj;
60             // x - hi e_i + hj e_j
61             xmp[i] = x[i] - hi; xmp[j] = x[j] + hj;
62             // x - hi e_i - hj e_j
63             xmm[i] = x[i] - hi; xmm[j] = x[j] - hj;
64 
65             double fpp = f(xpp);
66             double fpm = f(xpm);
67             double fmp = f(xmp);
68             double fmm = f(xmm);
69 
70             double Hij = (fpp - fpm - fmp + fmm) / (4.0 * hi * hj);
71             H[i][j] = H[j][i] = Hij; // enforce symmetry
72 
73             // reset perturbed components
74             xpp[i] = xpm[i] = xmp[i] = xmm[i] = x[i];
75             xpp[j] = xpm[j] = xmp[j] = xmm[j] = x[j];
76         }
77     }
78 
79     return H;
80 }
81 
82 // Example: 2D Rosenbrock function (a classic nonconvex test problem)
83 double rosenbrock(const std::vector<double>& x) {
84     double a = 1.0, b = 100.0;
85     double x0 = x[0], x1 = x[1];
86     return (a - x0) * (a - x0) + b * (x1 - x0 * x0) * (x1 - x0 * x0);
87 }
88 
89 int main() {
90     std::vector<double> x = {-1.2, 1.0};
91 
92     auto g = numerical_gradient(rosenbrock, x);
93     auto H = numerical_hessian(rosenbrock, x);
94 
95     std::cout << std::fixed << std::setprecision(6);
96     std::cout << "x = [" << x[0] << ", " << x[1] << "]\n";
97     std::cout << "Gradient: [" << g[0] << ", " << g[1] << "]\n";
98     std::cout << "Hessian:\n";
99     for (auto& row : H) {
100         for (double v : row) std::cout << std::setw(12) << v << ' ';
101         std::cout << '\n';
102     }
103     return 0;
104 }
105

This program implements central-difference formulas to approximate the gradient and Hessian of a general f: R^n -> R. It then computes these at a point for the 2D Rosenbrock function, which has a curved valley and a unique minimizer at (1,1). The Hessian reveals strong anisotropic curvature: one eigenvalue is large (stiff direction), the other much smaller (flat direction).

Time: O(n^2) function evaluations for the Hessian; O(n) for the gradientSpace: O(n^2) to store the Hessian and O(n) for vectors

Classify Critical Points in 2D via Hessian Eigenvalues (Trace–Determinant Test)

1 #include <iostream>
2 #include <vector>
3 #include <functional>
4 #include <cmath>
5 #include <limits>
6 #include <string>
7 
8 // Reuse numerical Hessian for 2D (simplified inline for clarity)
9 std::vector<std::vector<double>> hessian2D(
10     const std::function<double(const std::vector<double>&)>& f,
11     const std::vector<double>& x) {
12     const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
13     double h0 = eps * std::max(1.0, std::abs(x[0]));
14     double h1 = eps * std::max(1.0, std::abs(x[1]));
15 
16     std::vector<std::vector<double>> H(2, std::vector<double>(2, 0.0));
17     double fx = f(x);
18     // Diagonals
19     std::vector<double> p = x, m = x;
20     p[0] = x[0] + h0; m[0] = x[0] - h0;
21     H[0][0] = (f(p) - 2*fx + f(m)) / (h0*h0);
22     p = x; m = x;
23     p[1] = x[1] + h1; m[1] = x[1] - h1;
24     H[1][1] = (f(p) - 2*fx + f(m)) / (h1*h1);
25     // Off-diagonal via 4-point stencil
26     std::vector<double> xpp = x, xpm = x, xmp = x, xmm = x;
27     xpp[0] = x[0] + h0; xpp[1] = x[1] + h1;
28     xpm[0] = x[0] + h0; xpm[1] = x[1] - h1;
29     xmp[0] = x[0] - h0; xmp[1] = x[1] + h1;
30     xmm[0] = x[0] - h0; xmm[1] = x[1] - h1;
31     double Hij = (f(xpp) - f(xpm) - f(xmp) + f(xmm)) / (4*h0*h1);
32     H[0][1] = H[1][0] = Hij;
33     return H;
34 }
35 
36 std::string classify2D(const std::vector<std::vector<double>>& H) {
37     // For symmetric 2x2, eigenvalues from trace and determinant
38     double a = H[0][0], b = H[0][1], c = H[1][0], d = H[1][1];
39     double tr = a + d;
40     double det = a*d - b*c;
41     if (det > 0 && tr > 0) return "Local minimum (H positive definite)";
42     if (det > 0 && tr < 0) return "Local maximum (H negative definite)";
43     if (det < 0) return "Saddle point (H indefinite)";
44     return "Inconclusive (determinant ~ 0); check higher-order terms.";
45 }
46 
47 // Example function: f(x,y) = x^4 + y^4 - 3x^2 - 3y^2
48 // Has a local maximum at (0,0) and minima away from origin.
49 double quartic(const std::vector<double>& x) {
50     double X = x[0], Y = x[1];
51     return std::pow(X,4) + std::pow(Y,4) - 3*X*X - 3*Y*Y;
52 }
53 
54 int main() {
55     std::vector<std::vector<double>> points = {
56         {0.0, 0.0},   // expect local maximum
57         {1.3, 0.0},   // off-origin point (likely moving toward a minimum)
58         {1.0, 1.0}    // a symmetric point
59     };
60     for (auto& x : points) {
61         auto H = hessian2D(quartic, x);
62         std::cout << "Point (" << x[0] << ", " << x[1] << ") -> "
63                   << classify2D(H) << "\n";
64     }
65     return 0;
66 }
67

This program computes a 2×2 numerical Hessian and classifies the local behavior using determinant and trace. In 2D, det(H) > 0 with positive trace indicates a local minimum, det(H) > 0 with negative trace indicates a local maximum, and det(H) < 0 indicates a saddle. When det ≈ 0, second-order information may be insufficient and higher-order terms or other diagnostics are needed.

Time: O(1) for the 2D classification once the Hessian is obtained; building the 2D Hessian is O(1) with a constant number of function callsSpace: O(1)

Newton's Method with Backtracking Line Search (Using Numerical Hessian)

1 #include <iostream>
2 #include <vector>
3 #include <functional>
4 #include <cmath>
5 #include <limits>
6 #include <algorithm>
7 #include <stdexcept>
8 #include <iomanip>
9 
10 // Dot product and norms
11 double dot(const std::vector<double>& a, const std::vector<double>& b) {
12     double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s;
13 }
14 
15 double norm2(const std::vector<double>& a) { return std::sqrt(dot(a,a)); }
16 
17 // Central-difference gradient
18 std::vector<double> numerical_gradient(const std::function<double(const std::vector<double>&)>& f,
19                                        const std::vector<double>& x) {
20     const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
21     int n = (int)x.size();
22     std::vector<double> g(n), xp = x, xm = x;
23     for (int i = 0; i < n; ++i) {
24         double h = eps * std::max(1.0, std::abs(x[i]));
25         xp[i] = x[i] + h; xm[i] = x[i] - h;
26         g[i] = (f(xp) - f(xm)) / (2*h);
27         xp[i] = xm[i] = x[i];
28     }
29     return g;
30 }
31 
32 // Central-difference Hessian (symmetric)
33 std::vector<std::vector<double>> numerical_hessian(const std::function<double(const std::vector<double>&)>& f,
34                                                    const std::vector<double>& x) {
35     const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
36     int n = (int)x.size();
37     std::vector<std::vector<double>> H(n, std::vector<double>(n, 0.0));
38     std::vector<double> xp = x, xm = x, xpp = x, xpm = x, xmp = x, xmm = x;
39     double fx = f(x);
40     // Diagonals
41     for (int i = 0; i < n; ++i) {
42         double h = eps * std::max(1.0, std::abs(x[i]));
43         xp[i] = x[i] + h; xm[i] = x[i] - h;
44         H[i][i] = (f(xp) - 2*fx + f(xm)) / (h*h);
45         xp[i] = xm[i] = x[i];
46     }
47     // Off-diagonals
48     for (int i = 0; i < n; ++i) {
49         double hi = eps * std::max(1.0, std::abs(x[i]));
50         for (int j = i+1; j < n; ++j) {
51             double hj = eps * std::max(1.0, std::abs(x[j]));
52             xpp[i]=x[i]+hi; xpp[j]=x[j]+hj;
53             xpm[i]=x[i]+hi; xpm[j]=x[j]-hj;
54             xmp[i]=x[i]-hi; xmp[j]=x[j]+hj;
55             xmm[i]=x[i]-hi; xmm[j]=x[j]-hj;
56             double Hij = (f(xpp) - f(xpm) - f(xmp) + f(xmm)) / (4*hi*hj);
57             H[i][j] = H[j][i] = Hij;
58             xpp[i]=xpm[i]=xmp[i]=xmm[i]=x[i];
59             xpp[j]=xpm[j]=xmp[j]=xmm[j]=x[j];
60         }
61     }
62     return H;
63 }
64 
65 // Solve A x = b with Gaussian elimination + partial pivoting (dense, O(n^3))
66 std::vector<double> solve_linear(std::vector<std::vector<double>> A, std::vector<double> b) {
67     int n = (int)A.size();
68     for (int i = 0; i < n; ++i) {
69         // Pivot
70         int piv = i;
71         double best = std::abs(A[i][i]);
72         for (int r = i+1; r < n; ++r) {
73             if (std::abs(A[r][i]) > best) { best = std::abs(A[r][i]); piv = r; }
74         }
75         if (best == 0.0) throw std::runtime_error("Singular matrix in solve_linear");
76         if (piv != i) { std::swap(A[piv], A[i]); std::swap(b[piv], b[i]); }
77         // Eliminate
78         for (int r = i+1; r < n; ++r) {
79             double m = A[r][i] / A[i][i];
80             for (int c = i; c < n; ++c) A[r][c] -= m * A[i][c];
81             b[r] -= m * b[i];
82         }
83     }
84     // Back substitution
85     std::vector<double> x(n);
86     for (int i = n-1; i >= 0; --i) {
87         double s = b[i];
88         for (int c = i+1; c < n; ++c) s -= A[i][c] * x[c];
89         x[i] = s / A[i][i];
90     }
91     return x;
92 }
93 
94 // Backtracking line search with Armijo condition
95 double line_search(const std::function<double(const std::vector<double>&)>& f,
96                    const std::vector<double>& x,
97                    const std::vector<double>& p,
98                    const std::vector<double>& g,
99                    double alpha0 = 1.0, double c = 1e-4, double rho = 0.5) {
100     double fx = f(x);
101     double dg = dot(g, p);
102     double alpha = alpha0;
103     std::vector<double> xn = x;
104     while (true) {
105         for (size_t i = 0; i < x.size(); ++i) xn[i] = x[i] + alpha * p[i];
106         double fnew = f(xn);
107         if (fnew <= fx + c * alpha * dg) break; // Armijo condition
108         alpha *= rho;
109         if (alpha < 1e-12) break; // safeguard
110     }
111     return alpha;
112 }
113 
114 // Rosenbrock function in 2D
115 double rosenbrock(const std::vector<double>& x) {
116     double a = 1.0, b = 100.0;
117     double x0 = x[0], x1 = x[1];
118     return (a - x0) * (a - x0) + b * (x1 - x0 * x0) * (x1 - x0 * x0);
119 }
120 
121 int main() {
122     std::vector<double> x = {-1.2, 1.0}; // initial guess
123     const int max_it = 50;
124     const double tol = 1e-6;
125 
126     std::cout << std::fixed << std::setprecision(6);
127     for (int it = 0; it < max_it; ++it) {
128         auto g = numerical_gradient(rosenbrock, x);
129         double gn = norm2(g);
130         std::cout << "it=" << it << ": f=" << rosenbrock(x) << ", ||g||=" << gn << "\n";
131         if (gn < tol) break;
132         auto H = numerical_hessian(rosenbrock, x);
133         // Regularize if necessary to avoid singular/indefinite H
134         for (size_t i = 0; i < H.size(); ++i) H[i][i] += 1e-6; // small damping
135         // Solve H p = -g
136         std::vector<double> rhs = g; for (double &v : rhs) v = -v;
137         std::vector<double> p = solve_linear(H, rhs);
138         // Line search for globalization
139         double a = line_search(rosenbrock, x, p, g);
140         for (size_t i = 0; i < x.size(); ++i) x[i] += a * p[i];
141     }
142     std::cout << "Solution approx: (" << x[0] << ", " << x[1] << ")\n";
143     return 0;
144 }
145

This code implements a basic Newton optimizer with backtracking line search using numerical gradient and Hessian. A tiny diagonal damping stabilizes the linear solve when the Hessian is nearly singular or indefinite. Starting from a poor guess on Rosenbrock, it converges near (1,1). For higher dimensions or expensive functions, consider Hessian–vector products and iterative solvers instead of forming H.

Time: Per iteration: O(n^2) function evaluations to build H (if numerical), plus O(n^3) for the dense linear solveSpace: O(n^2) for the Hessian and O(n) for vectors

1	#include <iostream>
2	#include <vector>
3	#include <functional>
4	#include <cmath>
5	#include <limits>
6	#include <iomanip>
7	#include <algorithm>
8
9	// Compute central-difference gradient
10	std::vector<double> numerical_gradient(
11	const std::function<double(const std::vector<double>&)>& f,
12	const std::vector<double>& x) {
13	const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
14	int n = (int)x.size();
15	std::vector<double> g(n, 0.0);
16	std::vector<double> xp = x, xm = x;
17	for (int i = 0; i < n; ++i) {
18	double hi = eps * std::max(1.0, std::abs(x[i]));
19	xp[i] = x[i] + hi;
20	xm[i] = x[i] - hi;
21	double fp = f(xp);
22	double fm = f(xm);
23	g[i] = (fp - fm) / (2.0 * hi);
24	xp[i] = xm[i] = x[i];
25	}
26	return g;
27	}
28
29	// Compute central-difference Hessian (symmetric) with O(n^2) evaluations
30	std::vector<std::vector<double>> numerical_hessian(
31	const std::function<double(const std::vector<double>&)>& f,
32	const std::vector<double>& x) {
33	const double eps = std::sqrt(std::numeric_limits<double>::epsilon());
34	int n = (int)x.size();
35	std::vector<std::vector<double>> H(n, std::vector<double>(n, 0.0));
36	std::vector<double> xp = x, xm = x, xpp = x, xpm = x, xmp = x, xmm = x;
37
38	double fx = f(x); // baseline value
39
40	// Diagonal second derivatives: f(x+h e_i) - 2 f(x) + f(x-h e_i)
41	for (int i = 0; i < n; ++i) {
42	double hi = eps * std::max(1.0, std::abs(x[i]));
43	xp[i] = x[i] + hi;
44	xm[i] = x[i] - hi;
45	double fp = f(xp);
46	double fm = f(xm);
47	H[i][i] = (fp - 2.0 * fx + fm) / (hi * hi);
48	xp[i] = xm[i] = x[i];
49	}
50
51	// Off-diagonal mixed partials using 4-point stencil (i < j)
52	for (int i = 0; i < n; ++i) {
53	double hi = eps * std::max(1.0, std::abs(x[i]));
54	for (int j = i + 1; j < n; ++j) {
55	double hj = eps * std::max(1.0, std::abs(x[j]));
56	// x + hi e_i + hj e_j
57	xpp[i] = x[i] + hi; xpp[j] = x[j] + hj;
58	// x + hi e_i - hj e_j
59	xpm[i] = x[i] + hi; xpm[j] = x[j] - hj;
60	// x - hi e_i + hj e_j
61	xmp[i] = x[i] - hi; xmp[j] = x[j] + hj;
62	// x - hi e_i - hj e_j
63	xmm[i] = x[i] - hi; xmm[j] = x[j] - hj;
64
65	double fpp = f(xpp);
66	double fpm = f(xpm);
67	double fmp = f(xmp);
68	double fmm = f(xmm);
69
70	double Hij = (fpp - fpm - fmp + fmm) / (4.0 * hi * hj);
71	H[i][j] = H[j][i] = Hij; // enforce symmetry
72
73	// reset perturbed components
74	xpp[i] = xpm[i] = xmp[i] = xmm[i] = x[i];
75	xpp[j] = xpm[j] = xmp[j] = xmm[j] = x[j];
76	}
77	}
78
79	return H;
80	}
81
82	// Example: 2D Rosenbrock function (a classic nonconvex test problem)
83	double rosenbrock(const std::vector<double>& x) {
84	double a = 1.0, b = 100.0;
85	double x0 = x[0], x1 = x[1];
86	return (a - x0) * (a - x0) + b * (x1 - x0 * x0) * (x1 - x0 * x0);
87	}
88
89	int main() {
90	std::vector<double> x = {-1.2, 1.0};
91
92	auto g = numerical_gradient(rosenbrock, x);
93	auto H = numerical_hessian(rosenbrock, x);
94
95	std::cout << std::fixed << std::setprecision(6);
96	std::cout << "x = [" << x[0] << ", " << x[1] << "]\n";
97	std::cout << "Gradient: [" << g[0] << ", " << g[1] << "]\n";
98	std::cout << "Hessian:\n";
99	for (auto& row : H) {
100	for (double v : row) std::cout << std::setw(12) << v << ' ';
101	std::cout << '\n';
102	}
103	return 0;
104	}
105