Hessian Matrix
Key Points
- â˘The Hessian matrix collects all second-order partial derivatives of a scalar function and measures local curvature.
- â˘For twice continuously differentiable functions, the Hessian is symmetric by Clairautâs theorem.
- â˘The quadratic term H p in the second-order Taylor expansion predicts how fast the function bends along direction p.
- â˘Positive definite Hessians indicate local minima, negative definite indicate local maxima, and indefinite indicate saddle points.
- â˘In optimization, Newtonâs method uses the Hessian to choose curvature-aware steps that often converge very fast near the optimum.
- â˘Numerically approximating a full nĂn Hessian with central differences typically requires O() function evaluations.
- â˘Eigenvalues of the Hessian describe principal curvatures; signs tell you the type of critical point and magnitudes tell you how sharp it is.
- â˘Hessianâvector products can be computed without forming H explicitly, enabling scalable second-order methods.
Prerequisites
- âMultivariable calculus (limits, partial derivatives, chain rule) â The Hessian is built from second-order partial derivatives; understanding derivatives and continuity is essential.
- âLinear algebra (vectors, matrices, eigenvalues) â The Hessian is a symmetric matrix; interpreting curvature requires eigenvalues, quadratic forms, and solving linear systems.
- âNumerical analysis and floating-point arithmetic â Finite-difference approximation accuracy and stability depend on step sizes and round-off behavior.
- âUnconstrained optimization basics â Using Hessians in Newton/trust-region methods requires understanding descent directions and line search/trust-region globalization.
Detailed Explanation
Tap terms for definitions01Overview
The Hessian matrix is a square matrix of all second-order partial derivatives of a real-valued function. If you have a function f that maps points from n-dimensional space to real numbers, the Hessian summarizes how the slope (gradient) changes as you move in different directions. While the gradient tells you the direction of steepest ascent, the Hessian tells you how that direction itself curvesâhow the function bends. This makes the Hessian central to understanding local shape: whether you are in a valley (minimum), on a hilltop (maximum), or on a saddle. In optimization and machine learning, the Hessian guides powerful algorithms like Newtonâs method and quasi-Newton methods that exploit curvature to make faster progress than methods that only use gradients. In numerical analysis, one often approximates Hessians with finite differences when analytic derivatives are not available. The Hessian also appears in physics (stability analysis), computer vision (blob/ridge detection), and statistics (Fisher information approximations). Importantly, when f is smooth enough (C^2), the Hessian is symmetric, enabling efficient linear algebra and meaningful eigenvalue interpretations that connect directly to convexity, stability, and step-size selection.
02Intuition & Analogies
Imagine hiking on a landscape described by a height function f(x, y). The gradient tells you which way is uphill right now. But the gradient alone cannot tell you whether the terrain is gently sloping or sharply curved. The Hessian fills that gap: itâs like a curvature compass that says how the slope changes as you step in any direction. If you step along some direction p, the term p^T H p tells you how quickly the slope increases or decreasesâpositive values mean the ground curves upward like a bowl, negative means it curves downward like a dome, and values of different signs in different directions indicate a saddle. Another analogy: think of driving on a road where the steering wheel angle is the gradient. The Hessian measures how much you need to turn the steering wheel more or less as you moveâhow rapidly the steering demand changes. In finance, consider the price of an option as a function of multiple risks; the Hessian entries quantify how sensitive the sensitivity isâthe curvature of price with respect to pairs of inputs. In machine learning, loss functions over parameters form a high-dimensional landscape; the Hessianâs eigenvalues show which directions are flat (small eigenvalues) or stiff (large ones). This helps decide step sizes and whether you might oscillate (negative curvature) if you move too aggressively. In short, while the gradient answers âwhich way?â, the Hessian answers âhow does that way change as you go?â, letting you predict and control motion more precisely.
03Formal Definition
04When to Use
Use the Hessian when you need to understand or exploit curvature. In unconstrained optimization, Newtonâs method and its variants leverage H to achieve quadratic convergence near a solution, drastically reducing iterations compared to gradient descent. In machine learning, analyzing Hessian eigenvalues helps diagnose sharp vs. flat minima, choose trust-region sizes, or precondition gradient methods. In statistics and econometrics, the observed Hessian (or its expectation, Fisher information) provides standard errors via curvature of the log-likelihood. In numerical PDEs and physics, Hessians arise in stability analysis: positive definiteness of an energy functionalâs Hessian signals stable equilibria. In computer vision, the Hessian of an image intensity function is used to detect ridges and blobs (e.g., via determinant or eigenvalues). Practically, compute or approximate Hessians when: (1) you need fast local convergence near a minimizer, (2) you must classify a critical point, (3) you require uncertainty quantification from curvature, or (4) you want to approximate function behavior by a quadratic model. When n is large, prefer Hessianâvector products (avoiding explicit H) or quasi-Newton updates (e.g., BFGS) that approximate curvature with lower cost.
â ď¸Common Mistakes
⢠Ignoring symmetry: For C^2 functions, H should be symmetric; numerical approximations can break symmetry due to round-off. Symmetrize by averaging with its transpose. ⢠Bad finite-difference step sizes: Steps that are too small amplify floating-point noise; too large induce truncation error. A common heuristic is h_i = \sqrt{\epsilon} \max(1, |x_i|). ⢠Misclassifying critical points: Checking only the determinant in 2D is not enough when det = 0; you must consider eigenvalues or higher-order terms. ⢠Forgetting domain issues: Hessians assume interior points where partials exist; at kinks or constraints, second derivatives may not exist or be irrelevant (use subgradients or KKT conditions). ⢠Solving Newton steps without safeguards: If H is not positive definite, pure Newton can diverge. Use line search, damping, or modify H (e.g., add \lambda I) to ensure descent. ⢠Mixing units/scales: Highly scaled variables can cause ill-conditioned Hessians and unstable linear solves; rescale features or precondition. ⢠Assuming convexity from a single Hessian: Positive definiteness at one point does not imply global convexity unless H(x) \succeq 0 for all x. ⢠Overlooking Hessianâvector tricks: Forming H explicitly is O(n^2) memory; use Hv products with iterative solvers when n is large.
Key Formulas
Hessian Definition
Explanation: Each entry is a second-order partial derivative. Arranging these entries yields the Hessian matrix H(x) that captures curvature.
Symmetry (ClairautâSchwarz)
Explanation: If mixed partials are continuous, their order does not matter. This makes the Hessian symmetric and enables eigen-decomposition.
Second-Order Taylor Expansion
Explanation: This quadratic approximation predicts how f changes for small steps p. The Hessianâs quadratic form is the curvature term.
Quadratic Form
Explanation: This expands curvature along direction p in coordinates. The sign and magnitude reveal bowl/dome/flatness along p.
Convexity via Hessian
Explanation: A twice-differentiable function is convex exactly when its Hessian is positive semidefinite everywhere. Nonnegativity of Hp for all p encodes this.
Newton Step
Explanation: The Newton direction solves H p = -g. When H is positive definite near a minimizer, this step often gives rapid (quadratic) convergence.
Second Derivative (Diagonal) Finite Difference
Explanation: A central difference approximation for the second derivative along axis i. Accuracy is second order in h.
Mixed Partial Finite Difference
Explanation: A symmetric stencil to approximate cross derivatives. Requires four function evaluations per (i,j) pair, yielding second-order accuracy.
2Ă2 Hessian Eigenvalues
Explanation: For 2D, eigenvalues depend on the trace and determinant. Their signs classify the critical point type.
HessianâVector Product
Explanation: The directional derivative of the gradient equals H times v. This identity enables computing Hv without forming H explicitly.
Complexity Analysis
Code Examples
1 #include <iostream> 2 #include <vector> 3 #include <functional> 4 #include <cmath> 5 #include <limits> 6 #include <iomanip> 7 #include <algorithm> 8 9 // Compute central-difference gradient 10 std::vector<double> numerical_gradient( 11 const std::function<double(const std::vector<double>&)>& f, 12 const std::vector<double>& x) { 13 const double eps = std::sqrt(std::numeric_limits<double>::epsilon()); 14 int n = (int)x.size(); 15 std::vector<double> g(n, 0.0); 16 std::vector<double> xp = x, xm = x; 17 for (int i = 0; i < n; ++i) { 18 double hi = eps * std::max(1.0, std::abs(x[i])); 19 xp[i] = x[i] + hi; 20 xm[i] = x[i] - hi; 21 double fp = f(xp); 22 double fm = f(xm); 23 g[i] = (fp - fm) / (2.0 * hi); 24 xp[i] = xm[i] = x[i]; 25 } 26 return g; 27 } 28 29 // Compute central-difference Hessian (symmetric) with O(n^2) evaluations 30 std::vector<std::vector<double>> numerical_hessian( 31 const std::function<double(const std::vector<double>&)>& f, 32 const std::vector<double>& x) { 33 const double eps = std::sqrt(std::numeric_limits<double>::epsilon()); 34 int n = (int)x.size(); 35 std::vector<std::vector<double>> H(n, std::vector<double>(n, 0.0)); 36 std::vector<double> xp = x, xm = x, xpp = x, xpm = x, xmp = x, xmm = x; 37 38 double fx = f(x); // baseline value 39 40 // Diagonal second derivatives: f(x+h e_i) - 2 f(x) + f(x-h e_i) 41 for (int i = 0; i < n; ++i) { 42 double hi = eps * std::max(1.0, std::abs(x[i])); 43 xp[i] = x[i] + hi; 44 xm[i] = x[i] - hi; 45 double fp = f(xp); 46 double fm = f(xm); 47 H[i][i] = (fp - 2.0 * fx + fm) / (hi * hi); 48 xp[i] = xm[i] = x[i]; 49 } 50 51 // Off-diagonal mixed partials using 4-point stencil (i < j) 52 for (int i = 0; i < n; ++i) { 53 double hi = eps * std::max(1.0, std::abs(x[i])); 54 for (int j = i + 1; j < n; ++j) { 55 double hj = eps * std::max(1.0, std::abs(x[j])); 56 // x + hi e_i + hj e_j 57 xpp[i] = x[i] + hi; xpp[j] = x[j] + hj; 58 // x + hi e_i - hj e_j 59 xpm[i] = x[i] + hi; xpm[j] = x[j] - hj; 60 // x - hi e_i + hj e_j 61 xmp[i] = x[i] - hi; xmp[j] = x[j] + hj; 62 // x - hi e_i - hj e_j 63 xmm[i] = x[i] - hi; xmm[j] = x[j] - hj; 64 65 double fpp = f(xpp); 66 double fpm = f(xpm); 67 double fmp = f(xmp); 68 double fmm = f(xmm); 69 70 double Hij = (fpp - fpm - fmp + fmm) / (4.0 * hi * hj); 71 H[i][j] = H[j][i] = Hij; // enforce symmetry 72 73 // reset perturbed components 74 xpp[i] = xpm[i] = xmp[i] = xmm[i] = x[i]; 75 xpp[j] = xpm[j] = xmp[j] = xmm[j] = x[j]; 76 } 77 } 78 79 return H; 80 } 81 82 // Example: 2D Rosenbrock function (a classic nonconvex test problem) 83 double rosenbrock(const std::vector<double>& x) { 84 double a = 1.0, b = 100.0; 85 double x0 = x[0], x1 = x[1]; 86 return (a - x0) * (a - x0) + b * (x1 - x0 * x0) * (x1 - x0 * x0); 87 } 88 89 int main() { 90 std::vector<double> x = {-1.2, 1.0}; 91 92 auto g = numerical_gradient(rosenbrock, x); 93 auto H = numerical_hessian(rosenbrock, x); 94 95 std::cout << std::fixed << std::setprecision(6); 96 std::cout << "x = [" << x[0] << ", " << x[1] << "]\n"; 97 std::cout << "Gradient: [" << g[0] << ", " << g[1] << "]\n"; 98 std::cout << "Hessian:\n"; 99 for (auto& row : H) { 100 for (double v : row) std::cout << std::setw(12) << v << ' '; 101 std::cout << '\n'; 102 } 103 return 0; 104 } 105
This program implements central-difference formulas to approximate the gradient and Hessian of a general f: R^n -> R. It then computes these at a point for the 2D Rosenbrock function, which has a curved valley and a unique minimizer at (1,1). The Hessian reveals strong anisotropic curvature: one eigenvalue is large (stiff direction), the other much smaller (flat direction).
1 #include <iostream> 2 #include <vector> 3 #include <functional> 4 #include <cmath> 5 #include <limits> 6 #include <string> 7 8 // Reuse numerical Hessian for 2D (simplified inline for clarity) 9 std::vector<std::vector<double>> hessian2D( 10 const std::function<double(const std::vector<double>&)>& f, 11 const std::vector<double>& x) { 12 const double eps = std::sqrt(std::numeric_limits<double>::epsilon()); 13 double h0 = eps * std::max(1.0, std::abs(x[0])); 14 double h1 = eps * std::max(1.0, std::abs(x[1])); 15 16 std::vector<std::vector<double>> H(2, std::vector<double>(2, 0.0)); 17 double fx = f(x); 18 // Diagonals 19 std::vector<double> p = x, m = x; 20 p[0] = x[0] + h0; m[0] = x[0] - h0; 21 H[0][0] = (f(p) - 2*fx + f(m)) / (h0*h0); 22 p = x; m = x; 23 p[1] = x[1] + h1; m[1] = x[1] - h1; 24 H[1][1] = (f(p) - 2*fx + f(m)) / (h1*h1); 25 // Off-diagonal via 4-point stencil 26 std::vector<double> xpp = x, xpm = x, xmp = x, xmm = x; 27 xpp[0] = x[0] + h0; xpp[1] = x[1] + h1; 28 xpm[0] = x[0] + h0; xpm[1] = x[1] - h1; 29 xmp[0] = x[0] - h0; xmp[1] = x[1] + h1; 30 xmm[0] = x[0] - h0; xmm[1] = x[1] - h1; 31 double Hij = (f(xpp) - f(xpm) - f(xmp) + f(xmm)) / (4*h0*h1); 32 H[0][1] = H[1][0] = Hij; 33 return H; 34 } 35 36 std::string classify2D(const std::vector<std::vector<double>>& H) { 37 // For symmetric 2x2, eigenvalues from trace and determinant 38 double a = H[0][0], b = H[0][1], c = H[1][0], d = H[1][1]; 39 double tr = a + d; 40 double det = a*d - b*c; 41 if (det > 0 && tr > 0) return "Local minimum (H positive definite)"; 42 if (det > 0 && tr < 0) return "Local maximum (H negative definite)"; 43 if (det < 0) return "Saddle point (H indefinite)"; 44 return "Inconclusive (determinant ~ 0); check higher-order terms."; 45 } 46 47 // Example function: f(x,y) = x^4 + y^4 - 3x^2 - 3y^2 48 // Has a local maximum at (0,0) and minima away from origin. 49 double quartic(const std::vector<double>& x) { 50 double X = x[0], Y = x[1]; 51 return std::pow(X,4) + std::pow(Y,4) - 3*X*X - 3*Y*Y; 52 } 53 54 int main() { 55 std::vector<std::vector<double>> points = { 56 {0.0, 0.0}, // expect local maximum 57 {1.3, 0.0}, // off-origin point (likely moving toward a minimum) 58 {1.0, 1.0} // a symmetric point 59 }; 60 for (auto& x : points) { 61 auto H = hessian2D(quartic, x); 62 std::cout << "Point (" << x[0] << ", " << x[1] << ") -> " 63 << classify2D(H) << "\n"; 64 } 65 return 0; 66 } 67
This program computes a 2Ă2 numerical Hessian and classifies the local behavior using determinant and trace. In 2D, det(H) > 0 with positive trace indicates a local minimum, det(H) > 0 with negative trace indicates a local maximum, and det(H) < 0 indicates a saddle. When det â 0, second-order information may be insufficient and higher-order terms or other diagnostics are needed.
1 #include <iostream> 2 #include <vector> 3 #include <functional> 4 #include <cmath> 5 #include <limits> 6 #include <algorithm> 7 #include <stdexcept> 8 #include <iomanip> 9 10 // Dot product and norms 11 double dot(const std::vector<double>& a, const std::vector<double>& b) { 12 double s = 0.0; for (size_t i = 0; i < a.size(); ++i) s += a[i]*b[i]; return s; 13 } 14 15 double norm2(const std::vector<double>& a) { return std::sqrt(dot(a,a)); } 16 17 // Central-difference gradient 18 std::vector<double> numerical_gradient(const std::function<double(const std::vector<double>&)>& f, 19 const std::vector<double>& x) { 20 const double eps = std::sqrt(std::numeric_limits<double>::epsilon()); 21 int n = (int)x.size(); 22 std::vector<double> g(n), xp = x, xm = x; 23 for (int i = 0; i < n; ++i) { 24 double h = eps * std::max(1.0, std::abs(x[i])); 25 xp[i] = x[i] + h; xm[i] = x[i] - h; 26 g[i] = (f(xp) - f(xm)) / (2*h); 27 xp[i] = xm[i] = x[i]; 28 } 29 return g; 30 } 31 32 // Central-difference Hessian (symmetric) 33 std::vector<std::vector<double>> numerical_hessian(const std::function<double(const std::vector<double>&)>& f, 34 const std::vector<double>& x) { 35 const double eps = std::sqrt(std::numeric_limits<double>::epsilon()); 36 int n = (int)x.size(); 37 std::vector<std::vector<double>> H(n, std::vector<double>(n, 0.0)); 38 std::vector<double> xp = x, xm = x, xpp = x, xpm = x, xmp = x, xmm = x; 39 double fx = f(x); 40 // Diagonals 41 for (int i = 0; i < n; ++i) { 42 double h = eps * std::max(1.0, std::abs(x[i])); 43 xp[i] = x[i] + h; xm[i] = x[i] - h; 44 H[i][i] = (f(xp) - 2*fx + f(xm)) / (h*h); 45 xp[i] = xm[i] = x[i]; 46 } 47 // Off-diagonals 48 for (int i = 0; i < n; ++i) { 49 double hi = eps * std::max(1.0, std::abs(x[i])); 50 for (int j = i+1; j < n; ++j) { 51 double hj = eps * std::max(1.0, std::abs(x[j])); 52 xpp[i]=x[i]+hi; xpp[j]=x[j]+hj; 53 xpm[i]=x[i]+hi; xpm[j]=x[j]-hj; 54 xmp[i]=x[i]-hi; xmp[j]=x[j]+hj; 55 xmm[i]=x[i]-hi; xmm[j]=x[j]-hj; 56 double Hij = (f(xpp) - f(xpm) - f(xmp) + f(xmm)) / (4*hi*hj); 57 H[i][j] = H[j][i] = Hij; 58 xpp[i]=xpm[i]=xmp[i]=xmm[i]=x[i]; 59 xpp[j]=xpm[j]=xmp[j]=xmm[j]=x[j]; 60 } 61 } 62 return H; 63 } 64 65 // Solve A x = b with Gaussian elimination + partial pivoting (dense, O(n^3)) 66 std::vector<double> solve_linear(std::vector<std::vector<double>> A, std::vector<double> b) { 67 int n = (int)A.size(); 68 for (int i = 0; i < n; ++i) { 69 // Pivot 70 int piv = i; 71 double best = std::abs(A[i][i]); 72 for (int r = i+1; r < n; ++r) { 73 if (std::abs(A[r][i]) > best) { best = std::abs(A[r][i]); piv = r; } 74 } 75 if (best == 0.0) throw std::runtime_error("Singular matrix in solve_linear"); 76 if (piv != i) { std::swap(A[piv], A[i]); std::swap(b[piv], b[i]); } 77 // Eliminate 78 for (int r = i+1; r < n; ++r) { 79 double m = A[r][i] / A[i][i]; 80 for (int c = i; c < n; ++c) A[r][c] -= m * A[i][c]; 81 b[r] -= m * b[i]; 82 } 83 } 84 // Back substitution 85 std::vector<double> x(n); 86 for (int i = n-1; i >= 0; --i) { 87 double s = b[i]; 88 for (int c = i+1; c < n; ++c) s -= A[i][c] * x[c]; 89 x[i] = s / A[i][i]; 90 } 91 return x; 92 } 93 94 // Backtracking line search with Armijo condition 95 double line_search(const std::function<double(const std::vector<double>&)>& f, 96 const std::vector<double>& x, 97 const std::vector<double>& p, 98 const std::vector<double>& g, 99 double alpha0 = 1.0, double c = 1e-4, double rho = 0.5) { 100 double fx = f(x); 101 double dg = dot(g, p); 102 double alpha = alpha0; 103 std::vector<double> xn = x; 104 while (true) { 105 for (size_t i = 0; i < x.size(); ++i) xn[i] = x[i] + alpha * p[i]; 106 double fnew = f(xn); 107 if (fnew <= fx + c * alpha * dg) break; // Armijo condition 108 alpha *= rho; 109 if (alpha < 1e-12) break; // safeguard 110 } 111 return alpha; 112 } 113 114 // Rosenbrock function in 2D 115 double rosenbrock(const std::vector<double>& x) { 116 double a = 1.0, b = 100.0; 117 double x0 = x[0], x1 = x[1]; 118 return (a - x0) * (a - x0) + b * (x1 - x0 * x0) * (x1 - x0 * x0); 119 } 120 121 int main() { 122 std::vector<double> x = {-1.2, 1.0}; // initial guess 123 const int max_it = 50; 124 const double tol = 1e-6; 125 126 std::cout << std::fixed << std::setprecision(6); 127 for (int it = 0; it < max_it; ++it) { 128 auto g = numerical_gradient(rosenbrock, x); 129 double gn = norm2(g); 130 std::cout << "it=" << it << ": f=" << rosenbrock(x) << ", ||g||=" << gn << "\n"; 131 if (gn < tol) break; 132 auto H = numerical_hessian(rosenbrock, x); 133 // Regularize if necessary to avoid singular/indefinite H 134 for (size_t i = 0; i < H.size(); ++i) H[i][i] += 1e-6; // small damping 135 // Solve H p = -g 136 std::vector<double> rhs = g; for (double &v : rhs) v = -v; 137 std::vector<double> p = solve_linear(H, rhs); 138 // Line search for globalization 139 double a = line_search(rosenbrock, x, p, g); 140 for (size_t i = 0; i < x.size(); ++i) x[i] += a * p[i]; 141 } 142 std::cout << "Solution approx: (" << x[0] << ", " << x[1] << ")\n"; 143 return 0; 144 } 145
This code implements a basic Newton optimizer with backtracking line search using numerical gradient and Hessian. A tiny diagonal damping stabilizes the linear solve when the Hessian is nearly singular or indefinite. Starting from a poor guess on Rosenbrock, it converges near (1,1). For higher dimensions or expensive functions, consider Hessianâvector products and iterative solvers instead of forming H.