Newton's Method & Second-Order Optimization
Key Points
- â˘Newton's method uses both the gradient and the Hessian to take steps that aim directly at the local optimum by fitting a quadratic model of the loss around the current point.
- â˘The Newton update solves a linear system H( Î = âL( and moves θ â θ â Î; this typically converges quadratically near the optimum.
- â˘You should never compute explicitly; instead, solve the linear system using methods like Cholesky or conjugate gradients.
- â˘Because Hessians can be indefinite or ill-conditioned, damping (adding line search, or trust regions are essential for stability.
- â˘Second-order methods can dramatically outperform first-order methods on well-scaled, moderately sized problems, but each iteration is more expensive.
- â˘For large-scale problems, use approximations like GaussâNewton, NewtonâCG, or quasi-Newton (BFGS) to avoid forming dense Hessians.
- â˘Careful stopping criteria (gradient norm, Newton decrement, or small steps) prevent over-iteration and numerical issues.
- â˘Feature scaling and regularization often make the Hessian better conditioned, improving convergence and numerical stability.
Prerequisites
- âMultivariable calculus â Gradients, Hessians, and Taylor expansions require understanding of partial derivatives and second-order terms.
- âLinear algebra â Solving H\Delta=g, Cholesky factorization, eigenvalues, and positive definiteness are core to Newtonâs method.
- âOptimization basics â Concepts like convexity, line search, and stopping criteria frame when and how Newtonâs method converges.
- âNumerical linear algebra â Stability concerns such as conditioning, factorization choices, and avoiding explicit matrix inverses are essential.
- âProbability and logistic regression â For the logistic regression example, understanding likelihoods and the sigmoid function helps interpret gradient/Hessian.
- âGradient descent â Provides a baseline first-order method to compare against Newtonâs acceleration and per-iteration costs.
- âMatrix calculus â Compactly derives gradients/Hessians for vector-valued models like generalized linear models.
- âBacktracking line search â Implements step-size selection that guarantees sufficient decrease for damped Newton.
- âConditioning and feature scaling â Explains why scaling and regularization improve Hessian definiteness and solver stability.
Detailed Explanation
Tap terms for definitions01Overview
Newton's method is a second-order optimization technique that uses curvature information to accelerate convergence. Instead of moving only in the direction of steepest descent (the gradient), it builds a local quadratic approximation of the objective function around the current point using both the gradient and the Hessian (the matrix of second derivatives). The method then jumps to the minimizer of this quadratic model, which mathematically amounts to solving a linear system involving the Hessian. Near a well-behaved minimum, Newtonâs method typically achieves quadratic convergence, meaning the number of correct digits roughly doubles each iteration. This makes it especially attractive when high precision is required. However, the methodâs per-iteration cost can be high because computing and factorizing the Hessian scales poorly with dimension. Moreover, the Hessian may not be positive definite away from the minimum, leading to unstable steps unless safeguards such as damping, line search, or trust regions are used. In practice, Newtonâs method shines on small to medium-sized, smooth problems (e.g., logistic regression with a few thousand features), and its ideas motivate a family of second-order and quasi-Newton algorithms for large-scale settings.
02Intuition & Analogies
Imagine hiking in mountainous terrain while trying to reach the lowest point. Gradient descent is like only looking at the local slope: if itâs steep, you take a big step downhill; if itâs gentle, you take a small one. This works but can be slow when the terrain has valleys shaped like narrow bowlsâyour steps zig-zag and progress stalls. Newtonâs method brings a curvature-aware map. Besides the slope (gradient), it measures how the slope itself changes (curvature via the Hessian). With both, you can fit a local bowl (a quadratic) to the terrain and jump near the bottom of that bowl in one go. If your map is accurateâmeaning youâre close to the true minimumâthe jump lands you extremely close, so you converge very quickly. But if the bowl you fit is upside-down (negative curvature) or lopsided (ill-conditioned), a naive jump can shoot you uphill or sideways. Thatâs why experienced hikers add brakes: they shorten the step (line search), slightly flatten the bowl by adding a cushion ÎťI (damping), or only trust the model inside a safe radius (trust region). For very large maps, drawing the full bowl is too costly, so you approximate it or compute the jump direction without explicitly building the entire bowl, much like using landmarks instead of surveying every rock.
03Formal Definition
04When to Use
Use Newtonâs method when your objective is smooth (twice differentiable), the Hessian is available or can be computed efficiently, and the parameter dimension is moderate so that solving linear systems is affordable. Classic use cases include fitting generalized linear models (e.g., logistic regression), maximum likelihood with well-behaved curvature, unconstrained convex optimization where the Hessian is positive definite, and problems requiring high-accuracy solutions. It is especially powerful near the optimum, where quadratic convergence provides a rapid finish. Consider damped or trust-region Newton when far from the solution, the Hessian may be indefinite, or the model is poorly scaled. For large-scale or high-dimensional settings where forming or factorizing H is too expensive, switch to approximations: GaussâNewton for least-squares, quasi-Newton methods like BFGS/L-BFGS, or NewtonâCG to compute the Newton direction via matrixâvector products. Always prefer solving linear systems over explicitly inverting the Hessian, and incorporate line search or trust regions to ensure robust progress.
â ď¸Common Mistakes
- Explicitly computing H^{-1}: This is numerically unstable and wasteful. Always solve H\Delta = g using factorization (e.g., Cholesky for SPD) or iterative solvers (CG) instead of forming the inverse.
- Ignoring indefiniteness: Taking a full Newton step when H is indefinite can increase the objective. Use damping (H+\lambda I), line search (Armijo/Wolfe), or trust regions to handle negative curvature.
- Skipping scaling/regularization: Poorly scaled features lead to ill-conditioned Hessians and erratic steps. Standardize inputs and, where appropriate, add L2 regularization to improve conditioning.
- Weak stopping criteria: Stopping solely on small parameter changes can be misleading. Monitor gradient norm, Newton decrement, and objective decrease to avoid premature or stalled termination.
- Incorrect derivatives: Analytical gradients/Hessians must be carefully derived and implemented; otherwise, Newton can diverge quickly. Validate with finite differences on small problems.
- Overbuilding the Hessian: For large n, forming dense H is prohibitive. Use Hessianâvector products and iterative solvers (NewtonâCG) or quasi-Newton updates.
- No safeguards on step length: Full steps may overshoot. Backtracking line search with Armijo or Wolfe conditions provides reliable decrease.
Key Formulas
Newton Update
Explanation: Update parameters by subtracting the Newton step, which solves a quadratic approximation of the loss. In practice, solve the linear system H = L instead of computing the inverse.
Second-Order Taylor Expansion
Explanation: Locally approximates the objective by a quadratic. Minimizing this quadratic gives the Newton step.
Normal Equations for Newton Step
Explanation: The Newton direction is the solution of this linear system. Solving it avoids explicitly forming the inverse of H.
Damped Newton Step
Explanation: A line search chooses a step size that ensures sufficient decrease. This stabilizes Newtonâs method when far from the minimizer.
Armijo Condition
Explanation: Guarantees that the step length provides a conservative decrease relative to the predicted linear model. Used in backtracking line search.
Newton Decrement
Explanation: Measures predicted improvement from the Newton step. A small decrement indicates near-optimality and can be used as a stopping test.
Hessian Entries
Explanation: Defines the Hessian as second partial derivatives. Positive definiteness implies local convexity.
Logistic Regression Gradient
Explanation: For m examples with features X and labels y, the gradient combines the data term with L2 regularization.
Logistic Regression Hessian
Explanation: The Hessian is SPD due to the positive weights W and L2 regularization. This makes Cholesky factorization appropriate.
Cholesky Solve Complexity
Explanation: Factoring an n n dense Hessian costs on the order of n cubed operations. This dominates the per-iteration cost for moderate n.
Complexity Analysis
Code Examples
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Example function: f(x) = (x - 2)^2 + 0.5 * sin(3x) 5 // f'(x) = 2(x - 2) + 1.5 * cos(3x) 6 // f''(x) = 2 - 4.5 * sin(3x) 7 8 struct Function1D { 9 static double f(double x) { 10 return (x - 2.0) * (x - 2.0) + 0.5 * sin(3.0 * x); 11 } 12 static double df(double x) { 13 return 2.0 * (x - 2.0) + 1.5 * cos(3.0 * x); 14 } 15 static double d2f(double x) { 16 return 2.0 - 4.5 * sin(3.0 * x); 17 } 18 }; 19 20 int main() { 21 ios::sync_with_stdio(false); 22 cin.tie(nullptr); 23 24 double x = -1.0; // initial guess 25 const int max_iter = 50; // maximum iterations 26 const double tol_g = 1e-8; // gradient tolerance 27 const double tol_step = 1e-12; // step size tolerance 28 29 // Backtracking line search parameters 30 const double c = 1e-4; // Armijo constant 31 const double beta = 0.5; // reduction factor 32 33 // Damping to handle small/negative curvature 34 double lambda = 0.0; // start undamped; adapt if needed 35 36 cout.setf(std::ios::fixed); cout << setprecision(8); 37 38 for (int it = 0; it < max_iter; ++it) { 39 double fx = Function1D::f(x); 40 double g = Function1D::df(x); 41 double h = Function1D::d2f(x); 42 43 if (fabs(g) < tol_g) { 44 cout << "Converged by gradient at iter " << it << ", x=" << x << ", f(x)=" << fx << "\n"; 45 break; 46 } 47 48 // Ensure stable denominator using damping if curvature is too small or negative 49 double denom = h + lambda; 50 if (denom <= 1e-12) { 51 // Increase damping until curvature is acceptable 52 lambda = max(1e-6, 2.0 * (1.0 - h)); 53 denom = h + lambda; 54 } 55 56 // Newton direction (for minimization): p = - g / (h + lambda) 57 double p = - g / denom; 58 59 // Backtracking line search to satisfy Armijo condition: f(x+alpha p) <= f(x) + c*alpha*g*p 60 double alpha = 1.0; 61 while (true) { 62 double x_trial = x + alpha * p; 63 double f_trial = Function1D::f(x_trial); 64 if (f_trial <= fx + c * alpha * g * p) break; 65 alpha *= beta; 66 if (alpha < 1e-12) break; // safeguard 67 } 68 69 double step = alpha * p; 70 x += step; 71 72 cout << "iter=" << it 73 << " x=" << x 74 << " f(x)=" << Function1D::f(x) 75 << " |g|=" << fabs(Function1D::df(x)) 76 << " step=" << step 77 << " lambda=" << lambda 78 << "\n"; 79 80 if (fabs(step) < tol_step) { 81 cout << "Converged by small step at iter " << it << "\n"; 82 break; 83 } 84 } 85 86 return 0; 87 } 88
This program minimizes a nonconvex 1D function using Newtonâs method with two safeguards: damping (to handle small or negative curvature) and Armijo backtracking (to ensure sufficient decrease). The update direction is p = -g/(h+\lambda). A line search scales the step to avoid overshooting. The method converges rapidly once it reaches a region where the quadratic model is accurate.
1 #include <bits/stdc++.h> 2 using namespace std; 3 4 // Dense vector/matrix helpers for small to moderate n 5 using Vec = vector<double>; 6 using Mat = vector<vector<double>>; 7 8 // Sigmoid function 9 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); } 10 11 // Compute X * theta (m x n times n) -> m 12 Vec matVec(const Mat &X, const Vec &theta) { 13 int m = (int)X.size(), n = (int)theta.size(); 14 Vec z(m, 0.0); 15 for (int i = 0; i < m; ++i) { 16 double s = 0.0; 17 for (int j = 0; j < n; ++j) s += X[i][j] * theta[j]; 18 z[i] = s; 19 } 20 return z; 21 } 22 23 // Compute X^T * v (n x m times m) -> n 24 Vec matTVec(const Mat &X, const Vec &v) { 25 int m = (int)X.size(), n = (int)X[0].size(); 26 Vec r(n, 0.0); 27 for (int j = 0; j < n; ++j) { 28 double s = 0.0; 29 for (int i = 0; i < m; ++i) s += X[i][j] * v[i]; 30 r[j] = s; 31 } 32 return r; 33 } 34 35 // A += B (both n x n) 36 void addInPlace(Mat &A, const Mat &B) { 37 int n = (int)A.size(); 38 for (int i = 0; i < n; ++i) 39 for (int j = 0; j < n; ++j) 40 A[i][j] += B[i][j]; 41 } 42 43 // A += lambda * I 44 void addLambdaI(Mat &A, double lambda) { 45 int n = (int)A.size(); 46 for (int i = 0; i < n; ++i) A[i][i] += lambda; 47 } 48 49 // Cholesky decomposition for SPD matrix A (n x n): A = L L^T 50 // Returns true on success; L is lower-triangular in-place stored in A 51 bool choleskyDecompose(Mat &A) { 52 int n = (int)A.size(); 53 for (int i = 0; i < n; ++i) { 54 for (int j = 0; j <= i; ++j) { 55 double sum = A[i][j]; 56 for (int k = 0; k < j; ++k) 57 sum -= A[i][k] * A[j][k]; 58 if (i == j) { 59 if (sum <= 0.0) return false; // not SPD 60 A[i][i] = sqrt(max(sum, 0.0)); 61 } else { 62 A[i][j] = sum / A[j][j]; 63 } 64 } 65 // Zero out upper triangle for clarity (not required) 66 for (int j = i + 1; j < n; ++j) A[i][j] = 0.0; 67 } 68 return true; 69 } 70 71 // Solve A x = b using Cholesky factorization stored in A as L (lower-triangular) 72 Vec choleskySolve(const Mat &L, const Vec &b) { 73 int n = (int)L.size(); 74 Vec y(n, 0.0), x(n, 0.0); 75 // Forward solve: L y = b 76 for (int i = 0; i < n; ++i) { 77 double sum = b[i]; 78 for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k]; 79 y[i] = sum / L[i][i]; 80 } 81 // Backward solve: L^T x = y 82 for (int i = n - 1; i >= 0; --i) { 83 double sum = y[i]; 84 for (int k = i + 1; k < n; ++k) sum -= L[k][i] * x[k]; 85 x[i] = sum / L[i][i]; 86 } 87 return x; 88 } 89 90 // Compute loss, gradient, and Hessian for L2-regularized logistic regression 91 // L(θ) = (1/m) * sum log(1 + exp(-y_i * x_i^T θ)) + (lambda/2) * ||θ||^2 92 // We implement using labels y in {0,1} for convenience. 93 struct Logistic { 94 const Mat &X; // m x n 95 const Vec &y; // m 96 double lambda; // L2 regularization strength 97 int m, n; 98 99 Logistic(const Mat &X_, const Vec &y_, double lambda_) : X(X_), y(y_), lambda(lambda_) { 100 m = (int)X.size(); 101 n = (int)X[0].size(); 102 } 103 104 double loss(const Vec &theta) const { 105 Vec z = matVec(X, theta); 106 double s = 0.0; 107 for (int i = 0; i < m; ++i) { 108 double p = sigmoid(z[i]); 109 // NLL for y in {0,1}: -[y log p + (1-y) log (1-p)] 110 s += -(y[i] * log(max(p, 1e-15)) + (1.0 - y[i]) * log(max(1.0 - p, 1e-15))); 111 } 112 // Average + L2 penalty 113 double reg = 0.0; 114 for (int j = 0; j < n; ++j) reg += theta[j] * theta[j]; 115 return s / m + 0.5 * lambda * reg; 116 } 117 118 Vec gradient(const Vec &theta) const { 119 Vec z = matVec(X, theta); 120 Vec p(m, 0.0); 121 for (int i = 0; i < m; ++i) p[i] = sigmoid(z[i]); 122 // grad = X^T (p - y)/m + lambda * theta 123 for (int i = 0; i < m; ++i) p[i] -= y[i]; 124 Vec g = matTVec(X, p); 125 for (double &gi : g) gi /= m; 126 Vec grad = g; 127 for (int j = 0; j < n; ++j) grad[j] += lambda * theta[j]; 128 return grad; 129 } 130 131 Mat hessian(const Vec &theta) const { 132 Vec z = matVec(X, theta); 133 Vec w(m, 0.0); 134 for (int i = 0; i < m; ++i) { 135 double p = sigmoid(z[i]); 136 w[i] = p * (1.0 - p); // diagonal entries of W 137 } 138 // Compute X^T W X / m + lambda I 139 Mat H(n, Vec(n, 0.0)); 140 // First compute X^T W: (n x m) 141 Mat XtW(n, Vec(m, 0.0)); 142 for (int j = 0; j < n; ++j) 143 for (int i = 0; i < m; ++i) 144 XtW[j][i] = X[i][j] * w[i]; 145 // Then H = (X^T W) X / m 146 for (int j = 0; j < n; ++j) { 147 for (int k = 0; k < n; ++k) { 148 double s = 0.0; 149 for (int i = 0; i < m; ++i) s += XtW[j][i] * X[i][k]; 150 H[j][k] = s / m; 151 } 152 } 153 // Add lambda I 154 for (int j = 0; j < n; ++j) H[j][j] += lambda; 155 return H; 156 } 157 }; 158 159 int main() { 160 ios::sync_with_stdio(false); 161 cin.tie(nullptr); 162 163 // Create a tiny synthetic dataset for binary classification 164 // Points near (2,2) -> class 1, near (-2,-2) -> class 0 165 int m = 100; // samples 166 int n = 2; // features 167 Mat X(m, Vec(n)); 168 Vec y(m); 169 std::mt19937 rng(42); 170 std::normal_distribution<double> noise(0.0, 0.5); 171 172 for (int i = 0; i < m; ++i) { 173 if (i < m/2) { 174 X[i][0] = -2.0 + noise(rng); 175 X[i][1] = -2.0 + noise(rng); 176 y[i] = 0.0; 177 } else { 178 X[i][0] = 2.0 + noise(rng); 179 X[i][1] = 2.0 + noise(rng); 180 y[i] = 1.0; 181 } 182 } 183 184 double lambda = 1e-2; // L2 regularization 185 Logistic lg(X, y, lambda); 186 187 Vec theta(n, 0.0); // initialize at zero 188 const int max_iter = 30; 189 const double tol_g = 1e-6; // gradient norm tolerance 190 191 const double c = 1e-4; // Armijo constant 192 const double beta = 0.5; // backtracking factor 193 194 cout.setf(std::ios::fixed); cout << setprecision(6); 195 196 for (int it = 0; it < max_iter; ++it) { 197 double L = lg.loss(theta); 198 Vec g = lg.gradient(theta); 199 // Compute gradient norm 200 double gnorm = 0.0; for (double gi : g) gnorm = max(gnorm, fabs(gi)); 201 if (gnorm < tol_g) { 202 cout << "Converged by gradient at iter " << it << ", loss=" << L << "\n"; 203 break; 204 } 205 206 // Assemble Hessian and apply (optional) additional damping for robustness 207 Mat H = lg.hessian(theta); 208 // Try Cholesky; if it fails (rare here), add damping and retry 209 double extra_damp = 0.0; 210 Mat Hcopy = H; 211 if (!choleskyDecompose(Hcopy)) { 212 // Add increasing damping until SPD 213 extra_damp = 1e-6; 214 while (extra_damp < 1e6) { 215 Mat Hd = H; 216 for (int j = 0; j < n; ++j) Hd[j][j] += extra_damp; 217 if (choleskyDecompose(Hd)) { Hcopy = Hd; break; } 218 extra_damp *= 10.0; 219 } 220 } 221 222 // Solve H * p = g (Newton system) and set direction d = -p 223 Vec p = choleskySolve(Hcopy, g); 224 Vec d(n, 0.0); 225 for (int j = 0; j < n; ++j) d[j] = -p[j]; 226 227 // Backtracking line search (Armijo) 228 double alpha = 1.0; 229 double gTd = 0.0; for (int j = 0; j < n; ++j) gTd += g[j] * d[j]; 230 while (true) { 231 Vec trial = theta; 232 for (int j = 0; j < n; ++j) trial[j] += alpha * d[j]; 233 double Ltrial = lg.loss(trial); 234 if (Ltrial <= L + c * alpha * gTd) { 235 theta = trial; 236 L = Ltrial; 237 break; 238 } 239 alpha *= beta; 240 if (alpha < 1e-10) break; // safeguard 241 } 242 243 cout << "iter=" << it 244 << " loss=" << lg.loss(theta) 245 << " |g|_inf=" << gnorm 246 << " step_alpha=" << fixed << setprecision(3) << (double)alpha 247 << " extra_damp=" << extra_damp 248 << " theta=[" << theta[0] << ", " << theta[1] << "]\n"; 249 } 250 251 // Report final parameters 252 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n"; 253 254 return 0; 255 } 256
This example implements Newtonâs method for L2-regularized logistic regression. It computes the gradient and Hessian exactly, solves the Newton system with a dense Cholesky factorization (exploiting SPD structure), and uses backtracking line search for robust decrease. If the Hessian is not SPD due to numerical issues, we add extra diagonal damping until Cholesky succeeds. This demonstrates how second-order curvature dramatically accelerates convergence on smooth, moderately sized problems.