⚙️AlgorithmAdvanced

Newton's Method & Second-Order Optimization

Key Points

•
Newton's method uses both the gradient and the Hessian to take steps that aim directly at the local optimum by fitting a quadratic model of the loss around the current point.
•
The Newton update solves a linear system H( $θ)$ Δ = ∇L( $θ)$ and moves θ ← θ − Δ; this typically converges quadratically near the optimum.
•
You should never compute $H^{- 1}$ explicitly; instead, solve the linear system using methods like Cholesky or conjugate gradients.
•
Because Hessians can be indefinite or ill-conditioned, damping (adding $λ I),$ line search, or trust regions are essential for stability.
•
Second-order methods can dramatically outperform first-order methods on well-scaled, moderately sized problems, but each iteration is more expensive.
•
For large-scale problems, use approximations like Gauss–Newton, Newton–CG, or quasi-Newton (BFGS) to avoid forming dense Hessians.
•
Careful stopping criteria (gradient norm, Newton decrement, or small steps) prevent over-iteration and numerical issues.
•
Feature scaling and regularization often make the Hessian better conditioned, improving convergence and numerical stability.

Prerequisites

→Multivariable calculus — Gradients, Hessians, and Taylor expansions require understanding of partial derivatives and second-order terms.
→Linear algebra — Solving H\Delta=g, Cholesky factorization, eigenvalues, and positive definiteness are core to Newton’s method.
→Optimization basics — Concepts like convexity, line search, and stopping criteria frame when and how Newton’s method converges.
→Numerical linear algebra — Stability concerns such as conditioning, factorization choices, and avoiding explicit matrix inverses are essential.
→Probability and logistic regression — For the logistic regression example, understanding likelihoods and the sigmoid function helps interpret gradient/Hessian.
→Gradient descent — Provides a baseline first-order method to compare against Newton’s acceleration and per-iteration costs.
→Matrix calculus — Compactly derives gradients/Hessians for vector-valued models like generalized linear models.
→Backtracking line search — Implements step-size selection that guarantees sufficient decrease for damped Newton.
→Conditioning and feature scaling — Explains why scaling and regularization improve Hessian definiteness and solver stability.

Detailed Explanation

Tap terms for definitions

01Overview

Newton's method is a second-order optimization technique that uses curvature information to accelerate convergence. Instead of moving only in the direction of steepest descent (the gradient), it builds a local quadratic approximation of the objective function around the current point using both the gradient and the Hessian (the matrix of second derivatives). The method then jumps to the minimizer of this quadratic model, which mathematically amounts to solving a linear system involving the Hessian. Near a well-behaved minimum, Newton’s method typically achieves quadratic convergence, meaning the number of correct digits roughly doubles each iteration. This makes it especially attractive when high precision is required. However, the method’s per-iteration cost can be high because computing and factorizing the Hessian scales poorly with dimension. Moreover, the Hessian may not be positive definite away from the minimum, leading to unstable steps unless safeguards such as damping, line search, or trust regions are used. In practice, Newton’s method shines on small to medium-sized, smooth problems (e.g., logistic regression with a few thousand features), and its ideas motivate a family of second-order and quasi-Newton algorithms for large-scale settings.

02Intuition & Analogies

Imagine hiking in mountainous terrain while trying to reach the lowest point. Gradient descent is like only looking at the local slope: if it’s steep, you take a big step downhill; if it’s gentle, you take a small one. This works but can be slow when the terrain has valleys shaped like narrow bowls—your steps zig-zag and progress stalls. Newton’s method brings a curvature-aware map. Besides the slope (gradient), it measures how the slope itself changes (curvature via the Hessian). With both, you can fit a local bowl (a quadratic) to the terrain and jump near the bottom of that bowl in one go. If your map is accurate—meaning you’re close to the true minimum—the jump lands you extremely close, so you converge very quickly. But if the bowl you fit is upside-down (negative curvature) or lopsided (ill-conditioned), a naive jump can shoot you uphill or sideways. That’s why experienced hikers add brakes: they shorten the step (line search), slightly flatten the bowl by adding a cushion λI (damping), or only trust the model inside a safe radius (trust region). For very large maps, drawing the full bowl is too costly, so you approximate it or compute the jump direction without explicitly building the entire bowl, much like using landmarks instead of surveying every rock.

03Formal Definition

Let L(

θ

) be a twice continuously differentiable objective. The second-order Taylor expansion at

θ_{t}

is L(

θ_{t}

Δ

)

\approx

θ_{t}

) +

\nabla

θ_{t}

)^{

⊤

}

Δ

\frac{1}{2}

Δ^{⊤}

θ_{t}

)

Δ

, where H(

θ_{t}

) is the Hessian. Minimizing this quadratic model yields the Newton step

Δ_{t}

by solving H(

θ_{t}

)\,

Δ_{t}

\nabla

θ_{t}

). The (full) Newton update is then

θ_{t + 1}

θ_{t}

Δ_{t}

. If H(

θ_{t}

) is positive definite and L is sufficiently smooth with a Lipschitz-continuous Hessian, then in a neighborhood of a non-degenerate minimizer

θ^{⋆}

, damped Newton with an appropriate line search attains quadratic convergence:

∥

θ_{t + 1}

θ^{⋆} ∥

\leq

∥

θ_{t}

θ^{⋆} ∥^{2}

for some constant C. Practical variants include damped Newton (solve (H+

λ

Δ

\nabla

L), trust-region Newton (minimize the quadratic model subject to

∥

Δ ∥

\leq

Δ_{m a x}

), Gauss–Newton (approximate H by

J^{⊤}

J for least-squares), and Newton–CG (solve the Newton system using conjugate gradients without forming H explicitly).

04When to Use

Use Newton’s method when your objective is smooth (twice differentiable), the Hessian is available or can be computed efficiently, and the parameter dimension is moderate so that solving linear systems is affordable. Classic use cases include fitting generalized linear models (e.g., logistic regression), maximum likelihood with well-behaved curvature, unconstrained convex optimization where the Hessian is positive definite, and problems requiring high-accuracy solutions. It is especially powerful near the optimum, where quadratic convergence provides a rapid finish. Consider damped or trust-region Newton when far from the solution, the Hessian may be indefinite, or the model is poorly scaled. For large-scale or high-dimensional settings where forming or factorizing H is too expensive, switch to approximations: Gauss–Newton for least-squares, quasi-Newton methods like BFGS/L-BFGS, or Newton–CG to compute the Newton direction via matrix–vector products. Always prefer solving linear systems over explicitly inverting the Hessian, and incorporate line search or trust regions to ensure robust progress.

⚠️Common Mistakes

Explicitly computing H^{-1}: This is numerically unstable and wasteful. Always solve H\Delta = g using factorization (e.g., Cholesky for SPD) or iterative solvers (CG) instead of forming the inverse.
Ignoring indefiniteness: Taking a full Newton step when H is indefinite can increase the objective. Use damping (H+\lambda I), line search (Armijo/Wolfe), or trust regions to handle negative curvature.
Skipping scaling/regularization: Poorly scaled features lead to ill-conditioned Hessians and erratic steps. Standardize inputs and, where appropriate, add L2 regularization to improve conditioning.
Weak stopping criteria: Stopping solely on small parameter changes can be misleading. Monitor gradient norm, Newton decrement, and objective decrease to avoid premature or stalled termination.
Incorrect derivatives: Analytical gradients/Hessians must be carefully derived and implemented; otherwise, Newton can diverge quickly. Validate with finite differences on small problems.
Overbuilding the Hessian: For large n, forming dense H is prohibitive. Use Hessian–vector products and iterative solvers (Newton–CG) or quasi-Newton updates.
No safeguards on step length: Full steps may overshoot. Backtracking line search with Armijo or Wolfe conditions provides reliable decrease.

Key Formulas

Newton Update

θ_{t + 1} = θ_{t} - H (θ_{t})^{- 1} \nabla L (θ_{t})

Explanation: Update parameters by subtracting the Newton step, which solves a quadratic approximation of the loss. In practice, solve the linear system H $Δ$ = $\nabla$ L instead of computing the inverse.

Second-Order Taylor Expansion

L (θ + Δ) \approx L (θ) + \nabla L (θ)^{⊤} Δ + \frac{1}{2} Δ^{⊤} H (θ) Δ

Explanation: Locally approximates the objective by a quadratic. Minimizing this quadratic gives the Newton step.

Normal Equations for Newton Step

H (θ) Δ = \nabla L (θ)

Explanation: The Newton direction is the solution of this linear system. Solving it avoids explicitly forming the inverse of H.

Damped Newton Step

θ_{t + 1} = θ_{t} - α_{t} Δ_{t}, α_{t} \in (0, 1]

Explanation: A line search chooses a step size that ensures sufficient decrease. This stabilizes Newton’s method when far from the minimizer.

Armijo Condition

L (θ - α Δ) \leq L (θ) - c α \nabla L (θ)^{⊤} Δ, c \in (0, 1)

Explanation: Guarantees that the step length provides a conservative decrease relative to the predicted linear model. Used in backtracking line search.

Newton Decrement

λ^{2} (θ) = \nabla L (θ)^{⊤} H (θ)^{- 1} \nabla L (θ)

Explanation: Measures predicted improvement from the Newton step. A small decrement indicates near-optimality and can be used as a stopping test.

Hessian Entries

H_{ij} (θ) = \frac{\partial ^{2} L ( θ )}{\partial θ _{i} \partial θ _{j}}

Explanation: Defines the Hessian as second partial derivatives. Positive definiteness implies local convexity.

Logistic Regression Gradient

\nabla L (θ) = X^{⊤} (σ (Xθ) - y) / m + λ θ

Explanation: For m examples with features X and labels y, the gradient combines the data term with L2 regularization.

Logistic Regression Hessian

H (θ) = X^{⊤} W X / m + λ I, W = diag (σ (z) ⊙ (1 - σ (z)))

Explanation: The Hessian is SPD due to the positive weights W and L2 regularization. This makes Cholesky factorization appropriate.

Cholesky Solve Complexity

O (n^{3})

Explanation: Factoring an n $\times$ n dense Hessian costs on the order of n cubed operations. This dominates the per-iteration cost for moderate n.

Complexity Analysis

Per iteration, Newton’s method requires: (1) computing the gradient g=

\nabla

θ

), (2) forming (or approximating) the Hessian H(

θ

), and (3) solving the linear system H

Δ

=g. If the problem has n parameters and we explicitly build a dense Hessian, forming H typically costs O(

n^{2}

) time plus any data-dependent cost (e.g., O(

m n^{2}

) for logistic regression with m samples), and O(

n^{2}

) space. Solving the linear system with a dense Cholesky factorization takes O(

n^{3}

) time and O(

n^{2}

) space. Thus, for moderate n, the factorization usually dominates. If H is structured (sparse, banded), specialized factorizations can reduce cost substantially (e.g., nearly linear in the number of nonzeros). When n is large, it is often better to avoid forming H and instead use Hessian–vector products in an iterative solver (Newton–CG): each CG iteration costs the time of an Hv product, and only O(n) additional space is needed; the total cost becomes O(

k_{CG}

T_{H v}

) per outer iteration, where

k_{CG}

is the number of CG steps. Line search or trust-region logic adds a small overhead proportional to the number of trial evaluations per iteration. Overall, for dense problems, the total time over T iterations is roughly O(T(

n^{3}

m n^{2}

)) and space O(

n^{2}

), whereas for matrix-free Newton–CG it is about O(T \,

k_{CG}

T_{H v}

) and O(n). In practice, scaling and regularization improve conditioning, reducing both iteration counts and backtracking steps.

Code Examples

Univariate damped Newton method for minimizing a smooth function

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Example function: f(x) = (x - 2)^2 + 0.5 * sin(3x)
5 // f'(x) = 2(x - 2) + 1.5 * cos(3x)
6 // f''(x) = 2 - 4.5 * sin(3x)
7 
8 struct Function1D {
9     static double f(double x) {
10         return (x - 2.0) * (x - 2.0) + 0.5 * sin(3.0 * x);
11     }
12     static double df(double x) {
13         return 2.0 * (x - 2.0) + 1.5 * cos(3.0 * x);
14     }
15     static double d2f(double x) {
16         return 2.0 - 4.5 * sin(3.0 * x);
17     }
18 };
19 
20 int main() {
21     ios::sync_with_stdio(false);
22     cin.tie(nullptr);
23 
24     double x = -1.0;               // initial guess
25     const int max_iter = 50;       // maximum iterations
26     const double tol_g = 1e-8;     // gradient tolerance
27     const double tol_step = 1e-12; // step size tolerance
28 
29     // Backtracking line search parameters
30     const double c = 1e-4;  // Armijo constant
31     const double beta = 0.5; // reduction factor
32 
33     // Damping to handle small/negative curvature
34     double lambda = 0.0;     // start undamped; adapt if needed
35 
36     cout.setf(std::ios::fixed); cout << setprecision(8);
37 
38     for (int it = 0; it < max_iter; ++it) {
39         double fx = Function1D::f(x);
40         double g = Function1D::df(x);
41         double h = Function1D::d2f(x);
42 
43         if (fabs(g) < tol_g) {
44             cout << "Converged by gradient at iter " << it << ", x=" << x << ", f(x)=" << fx << "\n";
45             break;
46         }
47 
48         // Ensure stable denominator using damping if curvature is too small or negative
49         double denom = h + lambda;
50         if (denom <= 1e-12) {
51             // Increase damping until curvature is acceptable
52             lambda = max(1e-6, 2.0 * (1.0 - h));
53             denom = h + lambda;
54         }
55 
56         // Newton direction (for minimization): p = - g / (h + lambda)
57         double p = - g / denom;
58 
59         // Backtracking line search to satisfy Armijo condition: f(x+alpha p) <= f(x) + c*alpha*g*p
60         double alpha = 1.0;
61         while (true) {
62             double x_trial = x + alpha * p;
63             double f_trial = Function1D::f(x_trial);
64             if (f_trial <= fx + c * alpha * g * p) break;
65             alpha *= beta;
66             if (alpha < 1e-12) break; // safeguard
67         }
68 
69         double step = alpha * p;
70         x += step;
71 
72         cout << "iter=" << it
73              << " x=" << x
74              << " f(x)=" << Function1D::f(x)
75              << " |g|=" << fabs(Function1D::df(x))
76              << " step=" << step
77              << " lambda=" << lambda
78              << "\n";
79 
80         if (fabs(step) < tol_step) {
81             cout << "Converged by small step at iter " << it << "\n";
82             break;
83         }
84     }
85 
86     return 0;
87 }
88

This program minimizes a nonconvex 1D function using Newton’s method with two safeguards: damping (to handle small or negative curvature) and Armijo backtracking (to ensure sufficient decrease). The update direction is p = -g/(h+\lambda). A line search scales the step to avoid overshooting. The method converges rapidly once it reaches a region where the quadratic model is accurate.

Time: O(T) function/derivative evaluations (constant-time per iteration)Space: O(1)

Multivariate Newton method for L2-regularized logistic regression using Cholesky

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 // Dense vector/matrix helpers for small to moderate n
5 using Vec = vector<double>;
6 using Mat = vector<vector<double>>;
7 
8 // Sigmoid function
9 static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
10 
11 // Compute X * theta (m x n times n) -> m
12 Vec matVec(const Mat &X, const Vec &theta) {
13     int m = (int)X.size(), n = (int)theta.size();
14     Vec z(m, 0.0);
15     for (int i = 0; i < m; ++i) {
16         double s = 0.0;
17         for (int j = 0; j < n; ++j) s += X[i][j] * theta[j];
18         z[i] = s;
19     }
20     return z;
21 }
22 
23 // Compute X^T * v (n x m times m) -> n
24 Vec matTVec(const Mat &X, const Vec &v) {
25     int m = (int)X.size(), n = (int)X[0].size();
26     Vec r(n, 0.0);
27     for (int j = 0; j < n; ++j) {
28         double s = 0.0;
29         for (int i = 0; i < m; ++i) s += X[i][j] * v[i];
30         r[j] = s;
31     }
32     return r;
33 }
34 
35 // A += B (both n x n)
36 void addInPlace(Mat &A, const Mat &B) {
37     int n = (int)A.size();
38     for (int i = 0; i < n; ++i)
39         for (int j = 0; j < n; ++j)
40             A[i][j] += B[i][j];
41 }
42 
43 // A += lambda * I
44 void addLambdaI(Mat &A, double lambda) {
45     int n = (int)A.size();
46     for (int i = 0; i < n; ++i) A[i][i] += lambda;
47 }
48 
49 // Cholesky decomposition for SPD matrix A (n x n): A = L L^T
50 // Returns true on success; L is lower-triangular in-place stored in A
51 bool choleskyDecompose(Mat &A) {
52     int n = (int)A.size();
53     for (int i = 0; i < n; ++i) {
54         for (int j = 0; j <= i; ++j) {
55             double sum = A[i][j];
56             for (int k = 0; k < j; ++k)
57                 sum -= A[i][k] * A[j][k];
58             if (i == j) {
59                 if (sum <= 0.0) return false; // not SPD
60                 A[i][i] = sqrt(max(sum, 0.0));
61             } else {
62                 A[i][j] = sum / A[j][j];
63             }
64         }
65         // Zero out upper triangle for clarity (not required)
66         for (int j = i + 1; j < n; ++j) A[i][j] = 0.0;
67     }
68     return true;
69 }
70 
71 // Solve A x = b using Cholesky factorization stored in A as L (lower-triangular)
72 Vec choleskySolve(const Mat &L, const Vec &b) {
73     int n = (int)L.size();
74     Vec y(n, 0.0), x(n, 0.0);
75     // Forward solve: L y = b
76     for (int i = 0; i < n; ++i) {
77         double sum = b[i];
78         for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k];
79         y[i] = sum / L[i][i];
80     }
81     // Backward solve: L^T x = y
82     for (int i = n - 1; i >= 0; --i) {
83         double sum = y[i];
84         for (int k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
85         x[i] = sum / L[i][i];
86     }
87     return x;
88 }
89 
90 // Compute loss, gradient, and Hessian for L2-regularized logistic regression
91 // L(θ) = (1/m) * sum log(1 + exp(-y_i * x_i^T θ)) + (lambda/2) * ||θ||^2
92 // We implement using labels y in {0,1} for convenience.
93 struct Logistic {
94     const Mat &X;        // m x n
95     const Vec &y;        // m
96     double lambda;       // L2 regularization strength
97     int m, n;
98 
99     Logistic(const Mat &X_, const Vec &y_, double lambda_) : X(X_), y(y_), lambda(lambda_) {
100         m = (int)X.size();
101         n = (int)X[0].size();
102     }
103 
104     double loss(const Vec &theta) const {
105         Vec z = matVec(X, theta);
106         double s = 0.0;
107         for (int i = 0; i < m; ++i) {
108             double p = sigmoid(z[i]);
109             // NLL for y in {0,1}: -[y log p + (1-y) log (1-p)]
110             s += -(y[i] * log(max(p, 1e-15)) + (1.0 - y[i]) * log(max(1.0 - p, 1e-15)));
111         }
112         // Average + L2 penalty
113         double reg = 0.0;
114         for (int j = 0; j < n; ++j) reg += theta[j] * theta[j];
115         return s / m + 0.5 * lambda * reg;
116     }
117 
118     Vec gradient(const Vec &theta) const {
119         Vec z = matVec(X, theta);
120         Vec p(m, 0.0);
121         for (int i = 0; i < m; ++i) p[i] = sigmoid(z[i]);
122         // grad = X^T (p - y)/m + lambda * theta
123         for (int i = 0; i < m; ++i) p[i] -= y[i];
124         Vec g = matTVec(X, p);
125         for (double &gi : g) gi /= m;
126         Vec grad = g;
127         for (int j = 0; j < n; ++j) grad[j] += lambda * theta[j];
128         return grad;
129     }
130 
131     Mat hessian(const Vec &theta) const {
132         Vec z = matVec(X, theta);
133         Vec w(m, 0.0);
134         for (int i = 0; i < m; ++i) {
135             double p = sigmoid(z[i]);
136             w[i] = p * (1.0 - p); // diagonal entries of W
137         }
138         // Compute X^T W X / m + lambda I
139         Mat H(n, Vec(n, 0.0));
140         // First compute X^T W: (n x m)
141         Mat XtW(n, Vec(m, 0.0));
142         for (int j = 0; j < n; ++j)
143             for (int i = 0; i < m; ++i)
144                 XtW[j][i] = X[i][j] * w[i];
145         // Then H = (X^T W) X / m
146         for (int j = 0; j < n; ++j) {
147             for (int k = 0; k < n; ++k) {
148                 double s = 0.0;
149                 for (int i = 0; i < m; ++i) s += XtW[j][i] * X[i][k];
150                 H[j][k] = s / m;
151             }
152         }
153         // Add lambda I
154         for (int j = 0; j < n; ++j) H[j][j] += lambda;
155         return H;
156     }
157 };
158 
159 int main() {
160     ios::sync_with_stdio(false);
161     cin.tie(nullptr);
162 
163     // Create a tiny synthetic dataset for binary classification
164     // Points near (2,2) -> class 1, near (-2,-2) -> class 0
165     int m = 100; // samples
166     int n = 2;   // features
167     Mat X(m, Vec(n));
168     Vec y(m);
169     std::mt19937 rng(42);
170     std::normal_distribution<double> noise(0.0, 0.5);
171 
172     for (int i = 0; i < m; ++i) {
173         if (i < m/2) {
174             X[i][0] = -2.0 + noise(rng);
175             X[i][1] = -2.0 + noise(rng);
176             y[i] = 0.0;
177         } else {
178             X[i][0] =  2.0 + noise(rng);
179             X[i][1] =  2.0 + noise(rng);
180             y[i] = 1.0;
181         }
182     }
183 
184     double lambda = 1e-2; // L2 regularization
185     Logistic lg(X, y, lambda);
186 
187     Vec theta(n, 0.0);            // initialize at zero
188     const int max_iter = 30;
189     const double tol_g = 1e-6;    // gradient norm tolerance
190 
191     const double c = 1e-4;        // Armijo constant
192     const double beta = 0.5;      // backtracking factor
193 
194     cout.setf(std::ios::fixed); cout << setprecision(6);
195 
196     for (int it = 0; it < max_iter; ++it) {
197         double L = lg.loss(theta);
198         Vec g = lg.gradient(theta);
199         // Compute gradient norm
200         double gnorm = 0.0; for (double gi : g) gnorm = max(gnorm, fabs(gi));
201         if (gnorm < tol_g) {
202             cout << "Converged by gradient at iter " << it << ", loss=" << L << "\n";
203             break;
204         }
205 
206         // Assemble Hessian and apply (optional) additional damping for robustness
207         Mat H = lg.hessian(theta);
208         // Try Cholesky; if it fails (rare here), add damping and retry
209         double extra_damp = 0.0;
210         Mat Hcopy = H;
211         if (!choleskyDecompose(Hcopy)) {
212             // Add increasing damping until SPD
213             extra_damp = 1e-6;
214             while (extra_damp < 1e6) {
215                 Mat Hd = H;
216                 for (int j = 0; j < n; ++j) Hd[j][j] += extra_damp;
217                 if (choleskyDecompose(Hd)) { Hcopy = Hd; break; }
218                 extra_damp *= 10.0;
219             }
220         }
221 
222         // Solve H * p = g (Newton system) and set direction d = -p
223         Vec p = choleskySolve(Hcopy, g);
224         Vec d(n, 0.0);
225         for (int j = 0; j < n; ++j) d[j] = -p[j];
226 
227         // Backtracking line search (Armijo)
228         double alpha = 1.0;
229         double gTd = 0.0; for (int j = 0; j < n; ++j) gTd += g[j] * d[j];
230         while (true) {
231             Vec trial = theta;
232             for (int j = 0; j < n; ++j) trial[j] += alpha * d[j];
233             double Ltrial = lg.loss(trial);
234             if (Ltrial <= L + c * alpha * gTd) {
235                 theta = trial;
236                 L = Ltrial;
237                 break;
238             }
239             alpha *= beta;
240             if (alpha < 1e-10) break; // safeguard
241         }
242 
243         cout << "iter=" << it
244              << " loss=" << lg.loss(theta)
245              << " |g|_inf=" << gnorm
246              << " step_alpha=" << fixed << setprecision(3) << (double)alpha
247              << " extra_damp=" << extra_damp
248              << " theta=[" << theta[0] << ", " << theta[1] << "]\n";
249     }
250 
251     // Report final parameters
252     cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n";
253 
254     return 0;
255 }
256

This example implements Newton’s method for L2-regularized logistic regression. It computes the gradient and Hessian exactly, solves the Newton system with a dense Cholesky factorization (exploiting SPD structure), and uses backtracking line search for robust decrease. If the Hessian is not SPD due to numerical issues, we add extra diagonal damping until Cholesky succeeds. This demonstrates how second-order curvature dramatically accelerates convergence on smooth, moderately sized problems.

Time: Per iteration: O(mn) for gradient, O(mn^2) to form H, and O(n^3) for Cholesky and solve.Space: O(n^2) for the dense Hessian plus O(mn) to hold data.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Example function: f(x) = (x - 2)^2 + 0.5 * sin(3x)
5	// f'(x) = 2(x - 2) + 1.5 * cos(3x)
6	// f''(x) = 2 - 4.5 * sin(3x)
7
8	struct Function1D {
9	static double f(double x) {
10	return (x - 2.0) * (x - 2.0) + 0.5 * sin(3.0 * x);
11	}
12	static double df(double x) {
13	return 2.0 * (x - 2.0) + 1.5 * cos(3.0 * x);
14	}
15	static double d2f(double x) {
16	return 2.0 - 4.5 * sin(3.0 * x);
17	}
18	};
19
20	int main() {
21	ios::sync_with_stdio(false);
22	cin.tie(nullptr);
23
24	double x = -1.0; // initial guess
25	const int max_iter = 50; // maximum iterations
26	const double tol_g = 1e-8; // gradient tolerance
27	const double tol_step = 1e-12; // step size tolerance
28
29	// Backtracking line search parameters
30	const double c = 1e-4; // Armijo constant
31	const double beta = 0.5; // reduction factor
32
33	// Damping to handle small/negative curvature
34	double lambda = 0.0; // start undamped; adapt if needed
35
36	cout.setf(std::ios::fixed); cout << setprecision(8);
37
38	for (int it = 0; it < max_iter; ++it) {
39	double fx = Function1D::f(x);
40	double g = Function1D::df(x);
41	double h = Function1D::d2f(x);
42
43	if (fabs(g) < tol_g) {
44	cout << "Converged by gradient at iter " << it << ", x=" << x << ", f(x)=" << fx << "\n";
45	break;
46	}
47
48	// Ensure stable denominator using damping if curvature is too small or negative
49	double denom = h + lambda;
50	if (denom <= 1e-12) {
51	// Increase damping until curvature is acceptable
52	lambda = max(1e-6, 2.0 * (1.0 - h));
53	denom = h + lambda;
54	}
55
56	// Newton direction (for minimization): p = - g / (h + lambda)
57	double p = - g / denom;
58
59	// Backtracking line search to satisfy Armijo condition: f(x+alpha p) <= f(x) + calphag*p
60	double alpha = 1.0;
61	while (true) {
62	double x_trial = x + alpha * p;
63	double f_trial = Function1D::f(x_trial);
64	if (f_trial <= fx + c * alpha * g * p) break;
65	alpha *= beta;
66	if (alpha < 1e-12) break; // safeguard
67	}
68
69	double step = alpha * p;
70	x += step;
71
72	cout << "iter=" << it
73	<< " x=" << x
74	<< " f(x)=" << Function1D::f(x)
75	<< " \|g\|=" << fabs(Function1D::df(x))
76	<< " step=" << step
77	<< " lambda=" << lambda
78	<< "\n";
79
80	if (fabs(step) < tol_step) {
81	cout << "Converged by small step at iter " << it << "\n";
82	break;
83	}
84	}
85
86	return 0;
87	}
88

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	// Dense vector/matrix helpers for small to moderate n
5	using Vec = vector<double>;
6	using Mat = vector<vector<double>>;
7
8	// Sigmoid function
9	static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
10
11	// Compute X * theta (m x n times n) -> m
12	Vec matVec(const Mat &X, const Vec &theta) {
13	int m = (int)X.size(), n = (int)theta.size();
14	Vec z(m, 0.0);
15	for (int i = 0; i < m; ++i) {
16	double s = 0.0;
17	for (int j = 0; j < n; ++j) s += X[i][j] * theta[j];
18	z[i] = s;
19	}
20	return z;
21	}
22
23	// Compute X^T * v (n x m times m) -> n
24	Vec matTVec(const Mat &X, const Vec &v) {
25	int m = (int)X.size(), n = (int)X[0].size();
26	Vec r(n, 0.0);
27	for (int j = 0; j < n; ++j) {
28	double s = 0.0;
29	for (int i = 0; i < m; ++i) s += X[i][j] * v[i];
30	r[j] = s;
31	}
32	return r;
33	}
34
35	// A += B (both n x n)
36	void addInPlace(Mat &A, const Mat &B) {
37	int n = (int)A.size();
38	for (int i = 0; i < n; ++i)
39	for (int j = 0; j < n; ++j)
40	A[i][j] += B[i][j];
41	}
42
43	// A += lambda * I
44	void addLambdaI(Mat &A, double lambda) {
45	int n = (int)A.size();
46	for (int i = 0; i < n; ++i) A[i][i] += lambda;
47	}
48
49	// Cholesky decomposition for SPD matrix A (n x n): A = L L^T
50	// Returns true on success; L is lower-triangular in-place stored in A
51	bool choleskyDecompose(Mat &A) {
52	int n = (int)A.size();
53	for (int i = 0; i < n; ++i) {
54	for (int j = 0; j <= i; ++j) {
55	double sum = A[i][j];
56	for (int k = 0; k < j; ++k)
57	sum -= A[i][k] * A[j][k];
58	if (i == j) {
59	if (sum <= 0.0) return false; // not SPD
60	A[i][i] = sqrt(max(sum, 0.0));
61	} else {
62	A[i][j] = sum / A[j][j];
63	}
64	}
65	// Zero out upper triangle for clarity (not required)
66	for (int j = i + 1; j < n; ++j) A[i][j] = 0.0;
67	}
68	return true;
69	}
70
71	// Solve A x = b using Cholesky factorization stored in A as L (lower-triangular)
72	Vec choleskySolve(const Mat &L, const Vec &b) {
73	int n = (int)L.size();
74	Vec y(n, 0.0), x(n, 0.0);
75	// Forward solve: L y = b
76	for (int i = 0; i < n; ++i) {
77	double sum = b[i];
78	for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k];
79	y[i] = sum / L[i][i];
80	}
81	// Backward solve: L^T x = y
82	for (int i = n - 1; i >= 0; --i) {
83	double sum = y[i];
84	for (int k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
85	x[i] = sum / L[i][i];
86	}
87	return x;
88	}
89
90	// Compute loss, gradient, and Hessian for L2-regularized logistic regression
91	// L(θ) = (1/m) * sum log(1 + exp(-y_i * x_i^T θ)) + (lambda/2) * \|\|θ\|\|^2
92	// We implement using labels y in {0,1} for convenience.
93	struct Logistic {
94	const Mat &X; // m x n
95	const Vec &y; // m
96	double lambda; // L2 regularization strength
97	int m, n;
98
99	Logistic(const Mat &X_, const Vec &y_, double lambda_) : X(X_), y(y_), lambda(lambda_) {
100	m = (int)X.size();
101	n = (int)X[0].size();
102	}
103
104	double loss(const Vec &theta) const {
105	Vec z = matVec(X, theta);
106	double s = 0.0;
107	for (int i = 0; i < m; ++i) {
108	double p = sigmoid(z[i]);
109	// NLL for y in {0,1}: -[y log p + (1-y) log (1-p)]
110	s += -(y[i] * log(max(p, 1e-15)) + (1.0 - y[i]) * log(max(1.0 - p, 1e-15)));
111	}
112	// Average + L2 penalty
113	double reg = 0.0;
114	for (int j = 0; j < n; ++j) reg += theta[j] * theta[j];
115	return s / m + 0.5 * lambda * reg;
116	}
117
118	Vec gradient(const Vec &theta) const {
119	Vec z = matVec(X, theta);
120	Vec p(m, 0.0);
121	for (int i = 0; i < m; ++i) p[i] = sigmoid(z[i]);
122	// grad = X^T (p - y)/m + lambda * theta
123	for (int i = 0; i < m; ++i) p[i] -= y[i];
124	Vec g = matTVec(X, p);
125	for (double &gi : g) gi /= m;
126	Vec grad = g;
127	for (int j = 0; j < n; ++j) grad[j] += lambda * theta[j];
128	return grad;
129	}
130
131	Mat hessian(const Vec &theta) const {
132	Vec z = matVec(X, theta);
133	Vec w(m, 0.0);
134	for (int i = 0; i < m; ++i) {
135	double p = sigmoid(z[i]);
136	w[i] = p * (1.0 - p); // diagonal entries of W
137	}
138	// Compute X^T W X / m + lambda I
139	Mat H(n, Vec(n, 0.0));
140	// First compute X^T W: (n x m)
141	Mat XtW(n, Vec(m, 0.0));
142	for (int j = 0; j < n; ++j)
143	for (int i = 0; i < m; ++i)
144	XtW[j][i] = X[i][j] * w[i];
145	// Then H = (X^T W) X / m
146	for (int j = 0; j < n; ++j) {
147	for (int k = 0; k < n; ++k) {
148	double s = 0.0;
149	for (int i = 0; i < m; ++i) s += XtW[j][i] * X[i][k];
150	H[j][k] = s / m;
151	}
152	}
153	// Add lambda I
154	for (int j = 0; j < n; ++j) H[j][j] += lambda;
155	return H;
156	}
157	};
158
159	int main() {
160	ios::sync_with_stdio(false);
161	cin.tie(nullptr);
162
163	// Create a tiny synthetic dataset for binary classification
164	// Points near (2,2) -> class 1, near (-2,-2) -> class 0
165	int m = 100; // samples
166	int n = 2; // features
167	Mat X(m, Vec(n));
168	Vec y(m);
169	std::mt19937 rng(42);
170	std::normal_distribution<double> noise(0.0, 0.5);
171
172	for (int i = 0; i < m; ++i) {
173	if (i < m/2) {
174	X[i][0] = -2.0 + noise(rng);
175	X[i][1] = -2.0 + noise(rng);
176	y[i] = 0.0;
177	} else {
178	X[i][0] = 2.0 + noise(rng);
179	X[i][1] = 2.0 + noise(rng);
180	y[i] = 1.0;
181	}
182	}
183
184	double lambda = 1e-2; // L2 regularization
185	Logistic lg(X, y, lambda);
186
187	Vec theta(n, 0.0); // initialize at zero
188	const int max_iter = 30;
189	const double tol_g = 1e-6; // gradient norm tolerance
190
191	const double c = 1e-4; // Armijo constant
192	const double beta = 0.5; // backtracking factor
193
194	cout.setf(std::ios::fixed); cout << setprecision(6);
195
196	for (int it = 0; it < max_iter; ++it) {
197	double L = lg.loss(theta);
198	Vec g = lg.gradient(theta);
199	// Compute gradient norm
200	double gnorm = 0.0; for (double gi : g) gnorm = max(gnorm, fabs(gi));
201	if (gnorm < tol_g) {
202	cout << "Converged by gradient at iter " << it << ", loss=" << L << "\n";
203	break;
204	}
205
206	// Assemble Hessian and apply (optional) additional damping for robustness
207	Mat H = lg.hessian(theta);
208	// Try Cholesky; if it fails (rare here), add damping and retry
209	double extra_damp = 0.0;
210	Mat Hcopy = H;
211	if (!choleskyDecompose(Hcopy)) {
212	// Add increasing damping until SPD
213	extra_damp = 1e-6;
214	while (extra_damp < 1e6) {
215	Mat Hd = H;
216	for (int j = 0; j < n; ++j) Hd[j][j] += extra_damp;
217	if (choleskyDecompose(Hd)) { Hcopy = Hd; break; }
218	extra_damp *= 10.0;
219	}
220	}
221
222	// Solve H * p = g (Newton system) and set direction d = -p
223	Vec p = choleskySolve(Hcopy, g);
224	Vec d(n, 0.0);
225	for (int j = 0; j < n; ++j) d[j] = -p[j];
226
227	// Backtracking line search (Armijo)
228	double alpha = 1.0;
229	double gTd = 0.0; for (int j = 0; j < n; ++j) gTd += g[j] * d[j];
230	while (true) {
231	Vec trial = theta;
232	for (int j = 0; j < n; ++j) trial[j] += alpha * d[j];
233	double Ltrial = lg.loss(trial);
234	if (Ltrial <= L + c * alpha * gTd) {
235	theta = trial;
236	L = Ltrial;
237	break;
238	}
239	alpha *= beta;
240	if (alpha < 1e-10) break; // safeguard
241	}
242
243	cout << "iter=" << it
244	<< " loss=" << lg.loss(theta)
245	<< " \|g\|_inf=" << gnorm
246	<< " step_alpha=" << fixed << setprecision(3) << (double)alpha
247	<< " extra_damp=" << extra_damp
248	<< " theta=[" << theta[0] << ", " << theta[1] << "]\n";
249	}
250
251	// Report final parameters
252	cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n";
253
254	return 0;
255	}
256