🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
⚙️AlgorithmAdvanced

Newton's Method & Second-Order Optimization

Key Points

  • •
    Newton's method uses both the gradient and the Hessian to take steps that aim directly at the local optimum by fitting a quadratic model of the loss around the current point.
  • •
    The Newton update solves a linear system H(θ) Δ = ∇L(θ) and moves θ ← θ − Δ; this typically converges quadratically near the optimum.
  • •
    You should never compute H−1 explicitly; instead, solve the linear system using methods like Cholesky or conjugate gradients.
  • •
    Because Hessians can be indefinite or ill-conditioned, damping (adding ÎťI), line search, or trust regions are essential for stability.
  • •
    Second-order methods can dramatically outperform first-order methods on well-scaled, moderately sized problems, but each iteration is more expensive.
  • •
    For large-scale problems, use approximations like Gauss–Newton, Newton–CG, or quasi-Newton (BFGS) to avoid forming dense Hessians.
  • •
    Careful stopping criteria (gradient norm, Newton decrement, or small steps) prevent over-iteration and numerical issues.
  • •
    Feature scaling and regularization often make the Hessian better conditioned, improving convergence and numerical stability.

Prerequisites

  • →Multivariable calculus — Gradients, Hessians, and Taylor expansions require understanding of partial derivatives and second-order terms.
  • →Linear algebra — Solving H\Delta=g, Cholesky factorization, eigenvalues, and positive definiteness are core to Newton’s method.
  • →Optimization basics — Concepts like convexity, line search, and stopping criteria frame when and how Newton’s method converges.
  • →Numerical linear algebra — Stability concerns such as conditioning, factorization choices, and avoiding explicit matrix inverses are essential.
  • →Probability and logistic regression — For the logistic regression example, understanding likelihoods and the sigmoid function helps interpret gradient/Hessian.
  • →Gradient descent — Provides a baseline first-order method to compare against Newton’s acceleration and per-iteration costs.
  • →Matrix calculus — Compactly derives gradients/Hessians for vector-valued models like generalized linear models.
  • →Backtracking line search — Implements step-size selection that guarantees sufficient decrease for damped Newton.
  • →Conditioning and feature scaling — Explains why scaling and regularization improve Hessian definiteness and solver stability.

Detailed Explanation

Tap terms for definitions

01Overview

Newton's method is a second-order optimization technique that uses curvature information to accelerate convergence. Instead of moving only in the direction of steepest descent (the gradient), it builds a local quadratic approximation of the objective function around the current point using both the gradient and the Hessian (the matrix of second derivatives). The method then jumps to the minimizer of this quadratic model, which mathematically amounts to solving a linear system involving the Hessian. Near a well-behaved minimum, Newton’s method typically achieves quadratic convergence, meaning the number of correct digits roughly doubles each iteration. This makes it especially attractive when high precision is required. However, the method’s per-iteration cost can be high because computing and factorizing the Hessian scales poorly with dimension. Moreover, the Hessian may not be positive definite away from the minimum, leading to unstable steps unless safeguards such as damping, line search, or trust regions are used. In practice, Newton’s method shines on small to medium-sized, smooth problems (e.g., logistic regression with a few thousand features), and its ideas motivate a family of second-order and quasi-Newton algorithms for large-scale settings.

02Intuition & Analogies

Imagine hiking in mountainous terrain while trying to reach the lowest point. Gradient descent is like only looking at the local slope: if it’s steep, you take a big step downhill; if it’s gentle, you take a small one. This works but can be slow when the terrain has valleys shaped like narrow bowls—your steps zig-zag and progress stalls. Newton’s method brings a curvature-aware map. Besides the slope (gradient), it measures how the slope itself changes (curvature via the Hessian). With both, you can fit a local bowl (a quadratic) to the terrain and jump near the bottom of that bowl in one go. If your map is accurate—meaning you’re close to the true minimum—the jump lands you extremely close, so you converge very quickly. But if the bowl you fit is upside-down (negative curvature) or lopsided (ill-conditioned), a naive jump can shoot you uphill or sideways. That’s why experienced hikers add brakes: they shorten the step (line search), slightly flatten the bowl by adding a cushion λI (damping), or only trust the model inside a safe radius (trust region). For very large maps, drawing the full bowl is too costly, so you approximate it or compute the jump direction without explicitly building the entire bowl, much like using landmarks instead of surveying every rock.

03Formal Definition

Let L(θ) be a twice continuously differentiable objective. The second-order Taylor expansion at θt​ is L(θt​+Δ) ≈ L(θt​) + ∇ L(θt​)^{⊤}Δ + 21​ Δ⊤ H(θt​)Δ, where H(θt​) is the Hessian. Minimizing this quadratic model yields the Newton step Δt​ by solving H(θt​)\,Δt​ = ∇ L(θt​). The (full) Newton update is then θt+1​ = θt​ - Δt​. If H(θt​) is positive definite and L is sufficiently smooth with a Lipschitz-continuous Hessian, then in a neighborhood of a non-degenerate minimizer θ⋆, damped Newton with an appropriate line search attains quadratic convergence: ∥ θt+1​-θ⋆∥ ≤ C ∥ θt​-θ⋆∥2 for some constant C. Practical variants include damped Newton (solve (H+λ I)Δ=∇ L), trust-region Newton (minimize the quadratic model subject to ∥ Δ∥ ≤ Δmax​), Gauss–Newton (approximate H by J⊤J for least-squares), and Newton–CG (solve the Newton system using conjugate gradients without forming H explicitly).

04When to Use

Use Newton’s method when your objective is smooth (twice differentiable), the Hessian is available or can be computed efficiently, and the parameter dimension is moderate so that solving linear systems is affordable. Classic use cases include fitting generalized linear models (e.g., logistic regression), maximum likelihood with well-behaved curvature, unconstrained convex optimization where the Hessian is positive definite, and problems requiring high-accuracy solutions. It is especially powerful near the optimum, where quadratic convergence provides a rapid finish. Consider damped or trust-region Newton when far from the solution, the Hessian may be indefinite, or the model is poorly scaled. For large-scale or high-dimensional settings where forming or factorizing H is too expensive, switch to approximations: Gauss–Newton for least-squares, quasi-Newton methods like BFGS/L-BFGS, or Newton–CG to compute the Newton direction via matrix–vector products. Always prefer solving linear systems over explicitly inverting the Hessian, and incorporate line search or trust regions to ensure robust progress.

⚠️Common Mistakes

  • Explicitly computing H^{-1}: This is numerically unstable and wasteful. Always solve H\Delta = g using factorization (e.g., Cholesky for SPD) or iterative solvers (CG) instead of forming the inverse.
  • Ignoring indefiniteness: Taking a full Newton step when H is indefinite can increase the objective. Use damping (H+\lambda I), line search (Armijo/Wolfe), or trust regions to handle negative curvature.
  • Skipping scaling/regularization: Poorly scaled features lead to ill-conditioned Hessians and erratic steps. Standardize inputs and, where appropriate, add L2 regularization to improve conditioning.
  • Weak stopping criteria: Stopping solely on small parameter changes can be misleading. Monitor gradient norm, Newton decrement, and objective decrease to avoid premature or stalled termination.
  • Incorrect derivatives: Analytical gradients/Hessians must be carefully derived and implemented; otherwise, Newton can diverge quickly. Validate with finite differences on small problems.
  • Overbuilding the Hessian: For large n, forming dense H is prohibitive. Use Hessian–vector products and iterative solvers (Newton–CG) or quasi-Newton updates.
  • No safeguards on step length: Full steps may overshoot. Backtracking line search with Armijo or Wolfe conditions provides reliable decrease.

Key Formulas

Newton Update

θt+1​=θt​−H(θt​)−1∇L(θt​)

Explanation: Update parameters by subtracting the Newton step, which solves a quadratic approximation of the loss. In practice, solve the linear system HΔ = ∇ L instead of computing the inverse.

Second-Order Taylor Expansion

L(θ+Δ)≈L(θ)+∇L(θ)⊤Δ+21​Δ⊤H(θ)Δ

Explanation: Locally approximates the objective by a quadratic. Minimizing this quadratic gives the Newton step.

Normal Equations for Newton Step

H(θ)Δ=∇L(θ)

Explanation: The Newton direction is the solution of this linear system. Solving it avoids explicitly forming the inverse of H.

Damped Newton Step

θt+1​=θt​−αt​Δt​,αt​∈(0,1]

Explanation: A line search chooses a step size that ensures sufficient decrease. This stabilizes Newton’s method when far from the minimizer.

Armijo Condition

L(θ−αΔ)≤L(θ)−cα∇L(θ)⊤Δ,c∈(0,1)

Explanation: Guarantees that the step length provides a conservative decrease relative to the predicted linear model. Used in backtracking line search.

Newton Decrement

λ2(θ)=∇L(θ)⊤H(θ)−1∇L(θ)

Explanation: Measures predicted improvement from the Newton step. A small decrement indicates near-optimality and can be used as a stopping test.

Hessian Entries

Hij​(θ)=∂θi​∂θj​∂2L(θ)​

Explanation: Defines the Hessian as second partial derivatives. Positive definiteness implies local convexity.

Logistic Regression Gradient

∇L(θ)=X⊤(σ(Xθ)−y)/m+λθ

Explanation: For m examples with features X and labels y, the gradient combines the data term with L2 regularization.

Logistic Regression Hessian

H(θ)=X⊤WX/m+λI,W=diag(σ(z)⊙(1−σ(z)))

Explanation: The Hessian is SPD due to the positive weights W and L2 regularization. This makes Cholesky factorization appropriate.

Cholesky Solve Complexity

O(n3)

Explanation: Factoring an n× n dense Hessian costs on the order of n cubed operations. This dominates the per-iteration cost for moderate n.

Complexity Analysis

Per iteration, Newton’s method requires: (1) computing the gradient g=∇ L(θ), (2) forming (or approximating) the Hessian H(θ), and (3) solving the linear system HΔ=g. If the problem has n parameters and we explicitly build a dense Hessian, forming H typically costs O(n2) time plus any data-dependent cost (e.g., O(mn2) for logistic regression with m samples), and O(n2) space. Solving the linear system with a dense Cholesky factorization takes O(n3) time and O(n2) space. Thus, for moderate n, the factorization usually dominates. If H is structured (sparse, banded), specialized factorizations can reduce cost substantially (e.g., nearly linear in the number of nonzeros). When n is large, it is often better to avoid forming H and instead use Hessian–vector products in an iterative solver (Newton–CG): each CG iteration costs the time of an Hv product, and only O(n) additional space is needed; the total cost becomes O(kCG​ \, THv​) per outer iteration, where kCG​ is the number of CG steps. Line search or trust-region logic adds a small overhead proportional to the number of trial evaluations per iteration. Overall, for dense problems, the total time over T iterations is roughly O(T(n3+mn2)) and space O(n2), whereas for matrix-free Newton–CG it is about O(T \, kCG​ \, THv​) and O(n). In practice, scaling and regularization improve conditioning, reducing both iteration counts and backtracking steps.

Code Examples

Univariate damped Newton method for minimizing a smooth function
1#include <bits/stdc++.h>
2using namespace std;
3
4// Example function: f(x) = (x - 2)^2 + 0.5 * sin(3x)
5// f'(x) = 2(x - 2) + 1.5 * cos(3x)
6// f''(x) = 2 - 4.5 * sin(3x)
7
8struct Function1D {
9 static double f(double x) {
10 return (x - 2.0) * (x - 2.0) + 0.5 * sin(3.0 * x);
11 }
12 static double df(double x) {
13 return 2.0 * (x - 2.0) + 1.5 * cos(3.0 * x);
14 }
15 static double d2f(double x) {
16 return 2.0 - 4.5 * sin(3.0 * x);
17 }
18};
19
20int main() {
21 ios::sync_with_stdio(false);
22 cin.tie(nullptr);
23
24 double x = -1.0; // initial guess
25 const int max_iter = 50; // maximum iterations
26 const double tol_g = 1e-8; // gradient tolerance
27 const double tol_step = 1e-12; // step size tolerance
28
29 // Backtracking line search parameters
30 const double c = 1e-4; // Armijo constant
31 const double beta = 0.5; // reduction factor
32
33 // Damping to handle small/negative curvature
34 double lambda = 0.0; // start undamped; adapt if needed
35
36 cout.setf(std::ios::fixed); cout << setprecision(8);
37
38 for (int it = 0; it < max_iter; ++it) {
39 double fx = Function1D::f(x);
40 double g = Function1D::df(x);
41 double h = Function1D::d2f(x);
42
43 if (fabs(g) < tol_g) {
44 cout << "Converged by gradient at iter " << it << ", x=" << x << ", f(x)=" << fx << "\n";
45 break;
46 }
47
48 // Ensure stable denominator using damping if curvature is too small or negative
49 double denom = h + lambda;
50 if (denom <= 1e-12) {
51 // Increase damping until curvature is acceptable
52 lambda = max(1e-6, 2.0 * (1.0 - h));
53 denom = h + lambda;
54 }
55
56 // Newton direction (for minimization): p = - g / (h + lambda)
57 double p = - g / denom;
58
59 // Backtracking line search to satisfy Armijo condition: f(x+alpha p) <= f(x) + c*alpha*g*p
60 double alpha = 1.0;
61 while (true) {
62 double x_trial = x + alpha * p;
63 double f_trial = Function1D::f(x_trial);
64 if (f_trial <= fx + c * alpha * g * p) break;
65 alpha *= beta;
66 if (alpha < 1e-12) break; // safeguard
67 }
68
69 double step = alpha * p;
70 x += step;
71
72 cout << "iter=" << it
73 << " x=" << x
74 << " f(x)=" << Function1D::f(x)
75 << " |g|=" << fabs(Function1D::df(x))
76 << " step=" << step
77 << " lambda=" << lambda
78 << "\n";
79
80 if (fabs(step) < tol_step) {
81 cout << "Converged by small step at iter " << it << "\n";
82 break;
83 }
84 }
85
86 return 0;
87}
88

This program minimizes a nonconvex 1D function using Newton’s method with two safeguards: damping (to handle small or negative curvature) and Armijo backtracking (to ensure sufficient decrease). The update direction is p = -g/(h+\lambda). A line search scales the step to avoid overshooting. The method converges rapidly once it reaches a region where the quadratic model is accurate.

Time: O(T) function/derivative evaluations (constant-time per iteration)Space: O(1)
Multivariate Newton method for L2-regularized logistic regression using Cholesky
1#include <bits/stdc++.h>
2using namespace std;
3
4// Dense vector/matrix helpers for small to moderate n
5using Vec = vector<double>;
6using Mat = vector<vector<double>>;
7
8// Sigmoid function
9static inline double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
10
11// Compute X * theta (m x n times n) -> m
12Vec matVec(const Mat &X, const Vec &theta) {
13 int m = (int)X.size(), n = (int)theta.size();
14 Vec z(m, 0.0);
15 for (int i = 0; i < m; ++i) {
16 double s = 0.0;
17 for (int j = 0; j < n; ++j) s += X[i][j] * theta[j];
18 z[i] = s;
19 }
20 return z;
21}
22
23// Compute X^T * v (n x m times m) -> n
24Vec matTVec(const Mat &X, const Vec &v) {
25 int m = (int)X.size(), n = (int)X[0].size();
26 Vec r(n, 0.0);
27 for (int j = 0; j < n; ++j) {
28 double s = 0.0;
29 for (int i = 0; i < m; ++i) s += X[i][j] * v[i];
30 r[j] = s;
31 }
32 return r;
33}
34
35// A += B (both n x n)
36void addInPlace(Mat &A, const Mat &B) {
37 int n = (int)A.size();
38 for (int i = 0; i < n; ++i)
39 for (int j = 0; j < n; ++j)
40 A[i][j] += B[i][j];
41}
42
43// A += lambda * I
44void addLambdaI(Mat &A, double lambda) {
45 int n = (int)A.size();
46 for (int i = 0; i < n; ++i) A[i][i] += lambda;
47}
48
49// Cholesky decomposition for SPD matrix A (n x n): A = L L^T
50// Returns true on success; L is lower-triangular in-place stored in A
51bool choleskyDecompose(Mat &A) {
52 int n = (int)A.size();
53 for (int i = 0; i < n; ++i) {
54 for (int j = 0; j <= i; ++j) {
55 double sum = A[i][j];
56 for (int k = 0; k < j; ++k)
57 sum -= A[i][k] * A[j][k];
58 if (i == j) {
59 if (sum <= 0.0) return false; // not SPD
60 A[i][i] = sqrt(max(sum, 0.0));
61 } else {
62 A[i][j] = sum / A[j][j];
63 }
64 }
65 // Zero out upper triangle for clarity (not required)
66 for (int j = i + 1; j < n; ++j) A[i][j] = 0.0;
67 }
68 return true;
69}
70
71// Solve A x = b using Cholesky factorization stored in A as L (lower-triangular)
72Vec choleskySolve(const Mat &L, const Vec &b) {
73 int n = (int)L.size();
74 Vec y(n, 0.0), x(n, 0.0);
75 // Forward solve: L y = b
76 for (int i = 0; i < n; ++i) {
77 double sum = b[i];
78 for (int k = 0; k < i; ++k) sum -= L[i][k] * y[k];
79 y[i] = sum / L[i][i];
80 }
81 // Backward solve: L^T x = y
82 for (int i = n - 1; i >= 0; --i) {
83 double sum = y[i];
84 for (int k = i + 1; k < n; ++k) sum -= L[k][i] * x[k];
85 x[i] = sum / L[i][i];
86 }
87 return x;
88}
89
90// Compute loss, gradient, and Hessian for L2-regularized logistic regression
91// L(θ) = (1/m) * sum log(1 + exp(-y_i * x_i^T θ)) + (lambda/2) * ||θ||^2
92// We implement using labels y in {0,1} for convenience.
93struct Logistic {
94 const Mat &X; // m x n
95 const Vec &y; // m
96 double lambda; // L2 regularization strength
97 int m, n;
98
99 Logistic(const Mat &X_, const Vec &y_, double lambda_) : X(X_), y(y_), lambda(lambda_) {
100 m = (int)X.size();
101 n = (int)X[0].size();
102 }
103
104 double loss(const Vec &theta) const {
105 Vec z = matVec(X, theta);
106 double s = 0.0;
107 for (int i = 0; i < m; ++i) {
108 double p = sigmoid(z[i]);
109 // NLL for y in {0,1}: -[y log p + (1-y) log (1-p)]
110 s += -(y[i] * log(max(p, 1e-15)) + (1.0 - y[i]) * log(max(1.0 - p, 1e-15)));
111 }
112 // Average + L2 penalty
113 double reg = 0.0;
114 for (int j = 0; j < n; ++j) reg += theta[j] * theta[j];
115 return s / m + 0.5 * lambda * reg;
116 }
117
118 Vec gradient(const Vec &theta) const {
119 Vec z = matVec(X, theta);
120 Vec p(m, 0.0);
121 for (int i = 0; i < m; ++i) p[i] = sigmoid(z[i]);
122 // grad = X^T (p - y)/m + lambda * theta
123 for (int i = 0; i < m; ++i) p[i] -= y[i];
124 Vec g = matTVec(X, p);
125 for (double &gi : g) gi /= m;
126 Vec grad = g;
127 for (int j = 0; j < n; ++j) grad[j] += lambda * theta[j];
128 return grad;
129 }
130
131 Mat hessian(const Vec &theta) const {
132 Vec z = matVec(X, theta);
133 Vec w(m, 0.0);
134 for (int i = 0; i < m; ++i) {
135 double p = sigmoid(z[i]);
136 w[i] = p * (1.0 - p); // diagonal entries of W
137 }
138 // Compute X^T W X / m + lambda I
139 Mat H(n, Vec(n, 0.0));
140 // First compute X^T W: (n x m)
141 Mat XtW(n, Vec(m, 0.0));
142 for (int j = 0; j < n; ++j)
143 for (int i = 0; i < m; ++i)
144 XtW[j][i] = X[i][j] * w[i];
145 // Then H = (X^T W) X / m
146 for (int j = 0; j < n; ++j) {
147 for (int k = 0; k < n; ++k) {
148 double s = 0.0;
149 for (int i = 0; i < m; ++i) s += XtW[j][i] * X[i][k];
150 H[j][k] = s / m;
151 }
152 }
153 // Add lambda I
154 for (int j = 0; j < n; ++j) H[j][j] += lambda;
155 return H;
156 }
157};
158
159int main() {
160 ios::sync_with_stdio(false);
161 cin.tie(nullptr);
162
163 // Create a tiny synthetic dataset for binary classification
164 // Points near (2,2) -> class 1, near (-2,-2) -> class 0
165 int m = 100; // samples
166 int n = 2; // features
167 Mat X(m, Vec(n));
168 Vec y(m);
169 std::mt19937 rng(42);
170 std::normal_distribution<double> noise(0.0, 0.5);
171
172 for (int i = 0; i < m; ++i) {
173 if (i < m/2) {
174 X[i][0] = -2.0 + noise(rng);
175 X[i][1] = -2.0 + noise(rng);
176 y[i] = 0.0;
177 } else {
178 X[i][0] = 2.0 + noise(rng);
179 X[i][1] = 2.0 + noise(rng);
180 y[i] = 1.0;
181 }
182 }
183
184 double lambda = 1e-2; // L2 regularization
185 Logistic lg(X, y, lambda);
186
187 Vec theta(n, 0.0); // initialize at zero
188 const int max_iter = 30;
189 const double tol_g = 1e-6; // gradient norm tolerance
190
191 const double c = 1e-4; // Armijo constant
192 const double beta = 0.5; // backtracking factor
193
194 cout.setf(std::ios::fixed); cout << setprecision(6);
195
196 for (int it = 0; it < max_iter; ++it) {
197 double L = lg.loss(theta);
198 Vec g = lg.gradient(theta);
199 // Compute gradient norm
200 double gnorm = 0.0; for (double gi : g) gnorm = max(gnorm, fabs(gi));
201 if (gnorm < tol_g) {
202 cout << "Converged by gradient at iter " << it << ", loss=" << L << "\n";
203 break;
204 }
205
206 // Assemble Hessian and apply (optional) additional damping for robustness
207 Mat H = lg.hessian(theta);
208 // Try Cholesky; if it fails (rare here), add damping and retry
209 double extra_damp = 0.0;
210 Mat Hcopy = H;
211 if (!choleskyDecompose(Hcopy)) {
212 // Add increasing damping until SPD
213 extra_damp = 1e-6;
214 while (extra_damp < 1e6) {
215 Mat Hd = H;
216 for (int j = 0; j < n; ++j) Hd[j][j] += extra_damp;
217 if (choleskyDecompose(Hd)) { Hcopy = Hd; break; }
218 extra_damp *= 10.0;
219 }
220 }
221
222 // Solve H * p = g (Newton system) and set direction d = -p
223 Vec p = choleskySolve(Hcopy, g);
224 Vec d(n, 0.0);
225 for (int j = 0; j < n; ++j) d[j] = -p[j];
226
227 // Backtracking line search (Armijo)
228 double alpha = 1.0;
229 double gTd = 0.0; for (int j = 0; j < n; ++j) gTd += g[j] * d[j];
230 while (true) {
231 Vec trial = theta;
232 for (int j = 0; j < n; ++j) trial[j] += alpha * d[j];
233 double Ltrial = lg.loss(trial);
234 if (Ltrial <= L + c * alpha * gTd) {
235 theta = trial;
236 L = Ltrial;
237 break;
238 }
239 alpha *= beta;
240 if (alpha < 1e-10) break; // safeguard
241 }
242
243 cout << "iter=" << it
244 << " loss=" << lg.loss(theta)
245 << " |g|_inf=" << gnorm
246 << " step_alpha=" << fixed << setprecision(3) << (double)alpha
247 << " extra_damp=" << extra_damp
248 << " theta=[" << theta[0] << ", " << theta[1] << "]\n";
249 }
250
251 // Report final parameters
252 cout << "Final theta: [" << theta[0] << ", " << theta[1] << "]\n";
253
254 return 0;
255}
256

This example implements Newton’s method for L2-regularized logistic regression. It computes the gradient and Hessian exactly, solves the Newton system with a dense Cholesky factorization (exploiting SPD structure), and uses backtracking line search for robust decrease. If the Hessian is not SPD due to numerical issues, we add extra diagonal damping until Cholesky succeeds. This demonstrates how second-order curvature dramatically accelerates convergence on smooth, moderately sized problems.

Time: Per iteration: O(mn) for gradient, O(mn^2) to form H, and O(n^3) for Cholesky and solve.Space: O(n^2) for the dense Hessian plus O(mn) to hold data.
#newton's method#second-order optimization#hessian#gradient#cholesky#line search#damping#trust region#logistic regression#gauss-newton#newton-cg#quasi-newton#armijo#regularization#curvature