Gabriele Farina - Hessians, preconditioning, and Newtons method

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Tue, Apr 22nd 2025
Lecture 17
Hessians, preconditioning, and Newton's method
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
So far, we have been concerned with first-order methods, that is, optimization methods that
use gradient information. With today, we will start discussing second-order methods, which
use not only the gradient but also the Hessian (second-order derivative) of the function
to be optimized. For that, we will now restrict our attention to optimization problems of
the form
min
𝑥 𝑓(𝑥) where 𝑥 ∈ ℝ𝑛,
where 𝑓(𝑥) is a twice-differentiable function.
L17.1 From first-order to second-order Taylor approximations
As we mentioned in Lecture 12 when introducing the gradient descent algorithm, a funda-
mental idea for constructing optimization algorithms is to approximate the function to be
optimized by a simpler function that is easier to minimize. In the case of gradient descent, we
picked a direction of movement based on the minimum of the first-order Taylor expansion of
the objective function around the current point. In the case of twice-differentiable functions,
we instead pick the direction of movement by looking at the second-order Taylor expansion
of the objective, which is a more faithful approximation of the function.
The second-order Taylor expansion of a function 𝑓(𝑥) around a point 𝑥𝑡 is given by
𝑓(𝑥) ≈ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ + 1
2⟨𝑥 − 𝑥𝑡, ∇2𝑓(𝑥𝑡)(𝑥 − 𝑥𝑡)⟩
where ∇2𝑓(𝑥𝑡) is the Hessian matrix of 𝑓 at 𝑥𝑡. The minimum of 𝑓(𝑥) can be found in
closed form by setting the gradient (with respect to 𝑥) of the above expression to zero,
which gives
∇𝑓(𝑥𝑡) + ∇2𝑓(𝑥𝑡)(𝑥 − 𝑥𝑡) = 0 ⟹ 𝑥 = 𝑥𝑡 − [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡).
So, we find that by moving from first-order to second-order Taylor approximation, and
assuming that ∇2𝑓(𝑥𝑡) is invertible, the natural direction of descent changes from
𝑑 = −∇𝑓(𝑥𝑡)⏟&'&(
using first-order
Taylor approximation
to 𝑑 = −[∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⏟&&&&'&&&&(
using second-order Taylor approximation
.
L17.1.1 Second-order direction as an example of preconditioning
The second-order direction of descent is obtained by multiplying the negative gradient (that
is, the first-order direction of descent) by the inverse of the Hessian matrix. This operation
is known as preconditioning the gradient by the Hessian. The Hessian preconditioning of
the gradient direction has the effect of making the optimization problem affinely invariant.
In other words, if we apply an affine transformation to the objective function and the initial
point, the second-order direction of descent will also be automatically transformed.
To demonstrate this property, consider the optimization problem
min
𝑥
s.t.
𝑓(𝑥)
𝑥 ∈ ℝ𝑛
and suppose that the coordinates of 𝑥 have been reparametrized via a new coordinate
system 𝑦 where the correspondence to 𝑥 is given by 𝑥 ↔ 𝐴𝑦 + 𝑏.
In the coordinate system of 𝑦, the function being minimized is 𝑔(𝑦) ≔ 𝑓(𝐴𝑦 + 𝑏); so, the
gradient and Hessian of 𝑔 can be computed as:
∇𝑦𝑔(𝑦) = 𝐴⊤∇𝑥𝑓(𝑥), ∇2
𝑦𝑔(𝑦) = 𝐴⊤∇2
𝑥𝑓(𝑥)𝐴
So,
−[∇2
𝑦𝑔(𝑦)]−1∇𝑦𝑔(𝑦)
⏟&&&'&&&(
𝑑𝑦
= −[𝐴⊤∇2
𝑥𝑓(𝑥)𝐴]−1(𝐴⊤∇𝑥𝑓(𝑥))
= −𝐴−1[∇2
𝑥𝑓(𝑥)]−1𝐴−⊤𝐴⊤∇𝑥𝑓(𝑥)
= 𝐴−1 ⋅ (−[∇2
𝑥𝑓(𝑥)]−1∇𝑥𝑓(𝑥))
⏟&&&&'&&&&(
𝑑𝑥
,
Hence, the second-order descent directions, measured with respect to 𝑦 and 𝑥, satisfy
𝑑𝑥 ↔ 𝐴𝑑𝑦,
which mimics the correspondence 𝑥 ↔ 𝐴𝑦 + 𝑏. So, for example, the update 𝑥′ = 𝑥 − 𝜂𝑑𝑥 in
the 𝑥 coordinate system corresponds to the update
𝑥′ = (𝐴𝑦 + 𝑏) − 𝜂𝐴𝑑𝑦 = 𝐴(𝑦 − 𝜂𝑑𝑦) + 𝑏 = 𝐴𝑦′ + 𝑏
in the 𝑦 coordinate system, preserving the correspondence at all steps. The same property
does not hold for gradient descent. We will have a more in-depth look into preconditioning
in the next Lecture.
L17.2 From gradient descent to Newton’s method
Having established the second-order direction of descent, we can define the natural gener-
alization of the gradient descent algorithm to the second-order setting. This algorithm,
which takes the name damped Newton’s method, is given by the update rule
𝑥𝑡+1 = 𝑥𝑡 − 𝜂[∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡) . (1)
For the choice 𝜂 = 1, which corresponds to setting 𝑥𝑡+1 to be the minimum of the second-
order approximation centered at 𝑥𝑡 at each iteration, this algorithm is known simply as
Newton’s method.
L17.2.1 Failure mode: lack of curvature
Given that the update rule of Newton’s method involves the inverse of the Hessian, a
natural concern is what happens when the Hessian is singular or near-singular (for example,
when one eigenvalue is extremely close to zero). Such a situation corresponds to the case
where the function has very little curvature, and the second-order approximation is not a
good approximation of the function. We now show that these concerns are well-founded by
considering a function whose curvature away from the minimum decays extremely fast.
Example L17.1. Consider the function
−2 −1.5 −1 −0.5 0.5 1 1.5 2 x
1
2
3
4
0
𝑓(𝑥) = log(𝑒2𝑥 + 𝑒−2𝑥),
plotted on the right, whose gradient
and Hessian are respectively computed
as
∇𝑓(𝑥) = 2 ⋅ 𝑒4𝑥 − 1
𝑒4𝑥 + 1 ,
∇2𝑓(𝑥) = 16 ⋅ 𝑒4𝑥
(𝑒4𝑥 + 1)2 .
The two tables below show the first 10 iterates of Newton’s method and gradient descent
when applied to 𝑓(𝑥) starting at two close initial points: 𝑥0 = 0.5 on the left, and 𝑥0 =
0.7 on the right. As you can see, the behavior of Newton’s method is very different: while
it converges extremely quickly to the minimum when starting at 𝑥0 = 0.5, it diverges
when starting at 𝑥0 = 0.7.
Newton’s method GD (𝜂 = 0.1)
𝑡 = 0 0.5000 0.5000
𝑡 = 1 −0.4067 0.3477
𝑡 = 2 0.2047 0.2274
𝑡 = 3 −0.0237 0.1422
𝑡 = 4 3.53 × 10−5 0.0868
𝑡 = 5 −1.17 × 10−13 0.0524
𝑡 = 6 −1.14 × 10−17 0.0315
𝑡 = 7 0.0000 0.0189
𝑡 = 8 0.0000 0.0114
Newton’s method GD (𝜂 = 0.1)
𝑡 = 0 0.7000 0.7000
𝑡 = 1 −1.3480 0.5229
𝑡 = 2 26.1045 0.3669
𝑡 = 3 −2.79 × 1044 0.2418
𝑡 = 4 diverged 0.1520
𝑡 = 5 diverged 0.0930
𝑡 = 6 diverged 0.0562
𝑡 = 7 diverged 0.0338
𝑡 = 8 diverged 0.0203
In the next section, we will analyze the convergence properties of Newton’s method formally.
L17.3 Analysis of Newton’s method
Example L17.1 shows that Newton’s method breaks down in the absence of sufficient
curvature. However, it also showed that the method can be extremely efficient when the
starting point is close to the minimum and the minimum has enough curvature. In this
section, we formalize these positive observations quantitatively.
L17.3.1 A key lemma
The analysis of Newton’s method and damped Newton’s method hinges on the following
lemma, which relates the geometric shrinking in the distance to optimality of the iterates
to the spectral norm of a specific matrix that depends on the Hessian of the function.
Theorem L17.1 (Newton’s method lemma). Let 𝑓 : ℝ𝑛 → ℝ be twice differentiable with
invertible Hessian, and let 𝑥⋆ be a local minimum of 𝑓. The distance to optimality of
the iterates 𝑥𝑡 generated by the damped Newton’s method with stepsize 𝜂 > 0 satisfy
𝑥𝑡+1 − 𝑥⋆ = (𝐼 − 𝜂𝐻𝑡)(𝑥𝑡 − 𝑥⋆),
where
𝐻𝑡 ≔ [∇2𝑓(𝑥𝑡)]−1 ∫
1
0
∇2𝑓(𝑥⋆ + 𝜆(𝑥𝑡 − 𝑥⋆)) d𝜆.
Proof. The fundamental idea of the proof is to write the second-order direction as a
function of the Hessian matrices on the segment connecting 𝑥⋆ to 𝑥𝑡.
In particular, we have
[∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡) = [∇2𝑓(𝑥𝑡)]−1(∇𝑓(𝑥𝑡) − ∇𝑓(𝑥⋆)) (since ∇𝑓(𝑥⋆) = 0
= [∇2𝑓(𝑥𝑡)]−1 ∫
1
0
∇2𝑓(𝑥⋆ + 𝜆(𝑥𝑡 − 𝑥⋆))(𝑥𝑡 − 𝑥⋆) d𝜆
= ([∇2𝑓(𝑥𝑡)]−1 ∫
1
0
∇2𝑓(𝑥⋆ + 𝜆(𝑥𝑡 − 𝑥⋆)) d𝜆)(𝑥𝑡 − 𝑥⋆)
= 𝐻𝑡(𝑥𝑡 − 𝑥⋆).
Hence, substituting this expression in the update rule of damped Newton’s method,
we find
𝑥𝑡+1 − 𝑥⋆ = (𝑥𝑡 − 𝑥⋆) − 𝜂[∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡) = (𝐼 − 𝜂𝐻𝑡)(𝑥𝑡 − 𝑥⋆),
as we wanted to show. □
The previous lemma shows that as long as the spectral norm of 𝐼 − 𝜂𝐻𝑡 is less than 1 (that
is, the maximum absolute value of any eigenvalue of 𝐼 − 𝜂𝐻𝑡 is less than 1), the distance to
optimality of the iterates 𝑥𝑡 generated by Newton’s method will decay exponentially fast.
We will leverage this result in the next two subsection to give local and global convergence
guarantees under different hypotheses.
L17.3.2 First corollary: Local convergence guarantees
As a first corollary of the previous lemma, we can derive a convergence guarantee for
Newton’s method when starting from a point that is “close enough” to a minimum with
sufficient curvature. This type of guarantee is known as a local convergence guarantee, as
it only applies to points that are within a certain distance from the solution. An empirical
illustration of this guarantee is shown in Example L17.1 when starting from the initial point
𝑥0 = 0.5 (left table of the example).
In particular, let 𝑥⋆ be a local minimum of 𝑓 with strong curvature, that is, a point such that
∇𝑓(𝑥⋆) = 0, and ∇2𝑓(𝑥⋆) ⪰ 𝜇𝐼 (2)
for some 𝜇 > 0. Furthermore, assume that 𝑓 is smooth, in the sense that its Hessian is 𝑀 -
Lipschitz continuous, that is
‖∇2𝑓(𝑥) − ∇2𝑓(𝑦)‖𝑠 ≤ 𝑀 ⋅ ‖𝑥 − 𝑦‖2. (3)
Our analysis will rest on the following high-level idea.
Since the Hessian at 𝑥⋆ is ⪰ 𝜇𝐼 and the Hessian is 𝑀 -Lipschitz continuous—and
therefore cannot change too fast—then we can determine a neighborhood of points
“sufficiently close” to 𝑥⋆ with strong curvature. This will allow us to upper bound the
spectral norm of 𝐼 − 𝐻𝑡 and invoke the general result of Theorem L17.1.
We make the above idea formal in the following.
Theorem L17.2. Under the assumptions (2) and (3) above, the spectral norm of the
matrix 𝐼 − 𝐻𝑡 induced at time 𝑡 by the iterate 𝑥𝑡 produced by Newton’s method satisfies
‖𝐼 − 𝐻𝑡‖𝑠 ≤ 𝑀
𝜇 ‖𝑥𝑡 − 𝑥⋆‖2
whenever
‖𝑥𝑡 − 𝑥⋆‖2 ≤ 𝜇
2𝑀 .
Proof. The key technique is to leverage the Lipschitz continuity of the Hessian to bound
the spectral norm of 𝐼 − 𝐻𝑡. To do that, we can make a difference of Hessians appear in
the expression of 𝐼 − 𝐻𝑡 by rewriting the identity matrix as follows:
𝐼 − 𝐻𝑡 = ([∇2𝑓(𝑥𝑡)]−1 ∫
1
0
∇2𝑓(𝑥𝑡) d𝜆) − 𝐻𝑡
= [∇2𝑓(𝑥𝑡)]−1 ∫
1
0
(∇2𝑓(𝑥𝑡) − ∇2𝑓(𝑥⋆ + 𝜆(𝑥𝑡 − 𝑥⋆))) d𝜆.
So, taking spectral norms on both sides, we find
‖𝐼 − 𝐻𝑡‖𝑠 ≤ ‖[∇2𝑓(𝑥𝑡)]−1‖𝑠
⋅ ∫
1
0
‖∇2𝑓(𝑥𝑡) − ∇2𝑓(𝑥⋆ + 𝜆(𝑥𝑡 − 𝑥⋆))‖𝑠 d𝜆
≤ ‖[∇2𝑓(𝑥𝑡)]−1‖𝑠
⋅ ∫
1
0
𝑀 (1 − 𝜆)‖𝑥𝑡 − 𝑥⋆‖2 d𝜆
= 𝑀
2 ‖[∇2𝑓(𝑥𝑡)]−1‖𝑠
⋅ ‖𝑥𝑡 − 𝑥⋆‖2. (4)
To complete the proof, we only need to bound the spectral norm of [∇2𝑓(𝑥𝑡)]−1. As
mentioned above, we will do so by using the fact that ∇2𝑓(𝑥⋆) ⪰ 𝜇𝐼 and the Hessian
changes slowly. In particular, we have
‖∇2𝑓(𝑥𝑡) − ∇2𝑓(𝑥⋆)‖𝑠 ≤ 𝑀 ‖𝑥𝑡 − 𝑥⋆‖2,
which implies that
−𝑀 ‖𝑥𝑡 − 𝑥⋆‖2𝐼 ⪯ ∇2𝑓(𝑥𝑡) − ∇2𝑓(𝑥⋆) ⪯ 𝑀 ‖𝑥𝑡 − 𝑥⋆‖2𝐼.
In particular,
∇2𝑓(𝑥𝑡) ⪰ ∇2𝑓(𝑥⋆) − 𝑀 ‖𝑥𝑡 − 𝑥⋆‖2𝐼 ⪰ (𝜇 − 𝑀 ‖𝑥𝑡 − 𝑥⋆‖2)𝐼 ⪰ 𝜇
2 𝐼,
where the last inequality used the hypothesis that ‖𝑥𝑡 − 𝑥⋆‖2 ≤ 𝜇
2𝑀 . Hence,
‖[∇2𝑓(𝑥𝑡)]−1‖𝑠
≤ 2
𝜇 , and plugging this bound into (4) yields the statement. □
In particular, since ‖𝐼 − 𝐻𝑡‖𝑠 ≤ 1/2, then Theorem L17.1 guarantees that the distance to
optimality decreases exponentially. In fact, since Theorem L17.2 bounds the spectral norm
of 𝐼 − 𝐻𝑡 as a linear function of ‖𝑥𝑡 − 𝑥⋆‖2, the decrease is doubly exponential. We have
just shown the following.
Theorem L17.3. Let 𝑓 : ℝ𝑛 → ℝ be twice differentiable with 𝑀 -Lipschitz continuous
Hessian, and let 𝑥⋆ be a local minimum of 𝑓 with strong curvature, that is, a point
such that
∇𝑓(𝑥⋆) = 0, and ∇2𝑓(𝑥⋆) ⪰ 𝜇𝐼
for some 𝜇 > 0. Then, as long as we start Newton’s method from a point 𝑥0 with
distance
‖𝑥0 − 𝑥⋆‖2 ≤ 𝜇
2𝑀
from the local minimum, the distance to optimality of the iterates 𝑥𝑡 generated by
Newton’s method decays as
‖𝑥𝑡+1 − 𝑥⋆‖2
𝜇/𝑀 ≤ (‖𝑥𝑡 − 𝑥⋆‖2
𝜇/𝑀 )
2
.
L17.3.3 Second corollary: Global convergence for strong curvature
In general, obtaining global convergence guarantees for Newton’s method is a much harder
task than obtaining local convergence guarantees. However, we can still obtain global
convergence guarantees for Newton’s method when the function is both 𝜇-strongly convex
and 𝐿-smooth, that is,
𝜇𝐼 ⪯ ∇2𝑓(𝑥) ⪯ 𝐿𝐼 ∀𝑥 ∈ ℝ𝑛.
In this case, we will seek to show that the distance to optimality of the iterates 𝑥𝑡 generated
by Newton’s method decays at a linear rate (i.e., exponentially fast).
Remark L17.1. The result is made somewhat less appealing by the fact that we already
know that gradient descent can reach a similar convergence rate for the same class of
smooth and strongly convex function, without even needing to invert the Hessian (see
Lecture 12, section on convergence rate for smooth PŁ functions).
To see why there is hope for this, conside that under these assumptions,¹
𝜇
𝐿𝐼 ⪯ [∇2𝑓(𝑥)]−1 ∫
1
0
∇2𝑓(𝑥⋆ + 𝜆(𝑥 − 𝑥⋆)) d𝜆 ⪯ 𝐿
𝜇 𝐼
so that
(1 − 𝜂 𝐿
𝜇 )𝐼 ⪯ 𝐼 − 𝜂𝐻𝑡 ⪯ (1 − 𝜂 𝜇
𝐿 )𝐼.
Hence, if 𝜂 ≤ 𝜇
𝐿 , we have
0 ⪯ 𝐼 − 𝜂𝐻𝑡 ⪯ (1 − 𝜂 𝜇
𝐿 )𝐼.
If 𝐻𝑡 was symmetric, this would immediately imply ‖𝐼 − 𝜂𝐻𝑡‖𝑠 ≤ 1 − 𝜂 𝜇
𝐿 , implying contrac-
tion to the optimum at the rate of (1 − 𝜂𝜇/𝐿)𝑡 under Theorem L17.1. Unfortunately, it is
not immediately obvious how to extend the previous approach directly beyond the case of
symmetric 𝐻𝑡.²
However, we can still obtain a similar result in the general case with a little of additional
work, by introducing a preconditioned version of the gradient descent lemma. Indeed, by 𝐿
-smoothness, we can write
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡+1 − 𝑥𝑡⟩ + 𝐿
2 ⟨𝑥𝑡+1 − 𝑥𝑡, 𝑥𝑡+1 − 𝑥𝑡⟩
≤ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡+1 − 𝑥𝑡⟩ + 𝐿
2𝜇 ⟨𝑥𝑡+1 − 𝑥𝑡, ∇2𝑓(𝑥𝑡)(𝑥𝑡+1 − 𝑥𝑡)⟩
= 𝑓(𝑥𝑡) − 𝜂⟨∇𝑓(𝑥𝑡), [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⟩ + 𝐿
2𝜇𝜂2⟨∇𝑓(𝑥𝑡), [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⟩
So, when 𝜂 ≤ 𝜇/𝐿, we obtain
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂
2 ⟨∇𝑓(𝑥𝑡)⊤, [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⟩. (5)
On the other hand, by the strong convexity of 𝑓, we have
𝑓(𝑦) ≥ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩ + 𝜇
2 ⟨𝑦 − 𝑥𝑡, 𝑦 − 𝑥𝑡⟩
≥ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩ + 𝜇
2𝐿⟨𝑦 − 𝑥𝑡, ∇2𝑓(𝑥𝑡)(𝑦 − 𝑥𝑡)⟩ ∀𝑦 ∈ ℝ𝑛.
Minimizing over 𝑦 on both sides, we obtain
𝑓(𝑥⋆) ≥ 𝑓(𝑥𝑡) − 𝐿
2𝜇 ⟨∇𝑓(𝑥𝑡), [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⟩,
⟺ 1
2 ⟨∇𝑓(𝑥𝑡), [∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡)⟩ ≥ 𝜇
𝐿 (𝑓(𝑥𝑡) − 𝑓(𝑥⋆)). (6)
So, by combining the two inequalities (5) and (6), we find that
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂 𝜇
𝐿 (𝑓(𝑥𝑡) − 𝑓(𝑥⋆))
⟹ 𝑓(𝑥𝑡+1) − 𝑓(𝑥⋆) ≤ (1 − 𝜂 𝜇
𝐿 )(𝑓(𝑥𝑡) − 𝑓(𝑥⋆))
⟹ 𝑓(𝑥𝑡+1) − 𝑓(𝑥⋆) ≤ (1 − 𝜂 𝜇
𝐿 )𝑡
(𝑓(𝑥1) − 𝑓(𝑥⋆)).
Since by strong convexity and 𝐿-smoothness
𝑓(𝑥𝑡+1) − 𝑓(𝑥⋆) ≥ 𝜇
2 ‖𝑥𝑡+1 − 𝑥⋆‖2
2, and 𝑓(𝑥1) − 𝑓(𝑥⋆) ≤ 𝐿
2 ‖𝑥1 − 𝑥⋆‖2
2,
we finally conclude the following.
Corollary L17.1. Let 𝑓 : ℝ𝑛 → ℝ be twice differentiable, 𝜇-strongly convex and 𝐿-
smooth. Then, the distance to optimality³ of the iterates 𝑥𝑡 generated by damped
Newton’s method with stepsize 𝜂 ≤ 𝜇
𝐿 decays exponentially fast at the rates
‖𝑥𝑡+1 − 𝑥⋆‖2
2 ≤ 𝐿
𝜇 (1 − 𝜂 𝜇
𝐿)𝑡
‖𝑥1 − 𝑥⋆‖2
2
and 𝑓(𝑥𝑡+1) − 𝑓(𝑥⋆) ≤ (1 − 𝜂 𝜇
𝐿 )𝑡
(𝑓(𝑥1) − 𝑓(𝑥⋆)).
L17.4 Further readings
Further material on (damped) Newton’s method can be found in Chapter 1.2.4 of Nesterov,
Y. [Nes18]‘s book. The use of preconditioning in optimization is discussed in Chapter 1.3
of the same book, under the name “variable metric method”.
■ Appendix: Cubic-regularized Newton method. As we have seen above, when the
Hessian matrix is ill conditioned, Newton’s method might take huge steps and diverge. For
that reason, we were able to show global convergence only in the limited setting of strong
curvature (albeit at an extremely fast rate). We will see in a couple of lectures another
extremely important function class for which Newton’s method can be shown to converge
globally: self-concordant functions.
Beyond these isolated classes of benign functions, one might wonder if second-order method
can be made more robust to ill-conditioned Hessian matrices. To resolve this question for
the positive, Nesterov, Y., & Polyak, B. T. [NP06] proposed a modification of Newton’s
method that combines the idea of proximal step to control the step size with the second-
order information of the Hessian matrix. In particular, in our pursuit of a good second-order
method we would like to be able to trade off the following two objectives:
• moving as much as possible in the direction of minimizing the second-order approxi-
mation
𝑓(𝑥) ≈ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ + 1
2 ⟨𝑥 − 𝑥𝑡, ∇2𝑓(𝑥𝑡)(𝑥 − 𝑥𝑡)⟩;
• staying in a sufficiently small neighborhood of 𝑥𝑡, where the approximation is accurate.
The update rule 𝑥𝑡+1 = 𝑥𝑡 − 𝜂[∇2𝑓(𝑥𝑡)]−1∇𝑓(𝑥𝑡) seen in (1) does not guarantee this,
as it might take large steps when the Hessian is ill conditioned.
As we saw already in Lecture 14, one way to capture this tension quantitatively is via the
notion of proximal step. For first-order methods, when using Euclidean distances to measure
the step size, this resulted in the choice of next iterate given by
𝑥𝑡+1 ≔ arg min
𝑥 𝜂⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ + 1
2‖𝑥 − 𝑥𝑡‖2
2,
which we saw was exactly equivalent to the gradient descent update rule 𝑥𝑡+1 = 𝑥𝑡 −
𝜂∇𝑓(𝑥𝑡). For second-order methods, a natural generalization of the approach results in the
update
𝑥𝑡+1 ∈ arg min
𝑥 𝜂(⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ + 1
2 ⟨𝑥 − 𝑥𝑡, ∇2𝑓(𝑥𝑡)(𝑥 − 𝑥𝑡)⟩) + 1
6 ‖𝑥 − 𝑥𝑡‖3
2 . (7)
The previous update rule is precisely the one proposed by Nesterov, Y., & Polyak, B. T.
[NP06]. We will call the resulting method cubic-regularized Newton’s method. As shown
by Nesterov, Y., & Polyak, B. T. [NP06], this method guarantees global convergence for
a much broader class of functions than the ones we have seen so far, while still enjoying
quadratic convergence rates once the iterates are close enough to the solution. In addition
to the original paper by Nesterov, Y., & Polyak, B. T. [NP06], you can find a treatment of
cubic-regularized Newton’s method in Chapter 4.1 of Nesterov, Y. [Nes18]’s book.
Bibliography for this lecture
[Nes18] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer International
Publishing. https://link.springer.com/book/10.1007/978-3-319-91578-4
[NP06] Nesterov, Y., & Polyak, B. T. (2006). Cubic regularization of Newton method
and its global performance. Math. Program., 108(1), 177–205. https://doi.org/
10.1007/s10107-006-0706-8
L17.A Appendix: Spectral norms
The spectral norm of a matrix 𝐴 measures the maximum increase in Euclidean norm that
𝐴 can produce, that is,
‖𝐴‖𝑠 ≔ max
𝑥≠0
‖𝐴𝑥‖2
‖𝑥‖2
.
(This type of norm, defined as the maximum effect that a function can have on the norm
of an input, is known as an operator norm.) In other words, if ‖𝐴‖𝑠 ≤ 𝑘, then we know that
‖𝐴𝑥‖2 ≤ 𝑘‖𝑥‖2
for all vectors 𝑥.
The spectral norm of a matrix is closely related to its eigenvalues. In particular, we have
the following characterization.
Theorem L17.4. For a symmetric matrix 𝐴, the spectral norm is equal to the maximum
absolute value of any eigenvalue of 𝐴.
Proof. The theorem is straightforward when 𝐴 is a diagonal matrix, where the maximum
is attained by any vector supported only on the coordinate with the maximum absolute
eigenvalue.
To show the result in general, we can reduce to the diagonal case by considering the
eigendecomposition
𝐴 = 𝑄⊤Λ𝑄,
where 𝑄 is an orthogonal matrix (that is, ‖𝑄𝑣‖2 = ‖𝑄⊤𝑣‖2 = ‖𝑣‖2 for all 𝑣) and Λ is
a diagonal matrix with the eigenvalues of 𝐴 on the diagonal (this decomposition exists
from the spectral theorem for symmetric real matrices). Then,
max
𝑥≠0
‖𝐴𝑥‖2
‖𝑥‖2
= max
𝑥≠0
‖𝑄⊤Λ𝑄𝑥‖2
‖𝑥‖2
= max
𝑥≠0
‖Λ𝑄𝑥‖2
‖𝑄𝑥‖2
,
where the last equality used the orthogonality of 𝑄 twice (once in the numerator to
remove on 𝑄⊤, and once in the denominator to add a 𝑄). Since 𝑄 is invertible, we can
therefore operate a change of variable and write
max
𝑥≠0
‖𝑄⊤Λ𝑄𝑥‖2
‖𝑄𝑥‖2
= max
𝑦≠0
‖Λ𝑦‖2
‖𝑦‖2
.
The result then follows from the diagonal case. □
An immediate consequence of the previous result is the following:
Corollary L17.2. Let 𝐴 ∈ ℝ𝑛×𝑛 be symmetric. Then, ‖𝐴‖𝑠 ≤ 𝑘 if and only if
−𝑘𝐼 ⪯ 𝐴 ⪯ 𝑘𝐼.
Proof. The condition −𝑘𝐼 ⪯ 𝐴 ⪯ 𝑘𝐼 is equivalent to asking that every eigenvalue of 𝐴
be in the range [−𝑘, 𝑘]. The result then follows from the previous theorem. □
Finally, we show that the spectral norm of a product of matrices is submultiplicative. This
follows pretty much directly from the definition of the spectral norm.
Theorem L17.5. For any matrices 𝐴, 𝐵 ∈ ℝ𝑛×𝑛, we have ‖𝐴𝐵‖𝑠 ≤ ‖𝐴‖𝑠‖𝐵‖𝑠.
Proof. Let 𝑥 ≠ 0 be a vector. Then,
‖𝐴𝐵𝑥‖2 ≤ ‖𝐴‖𝑠‖𝐵𝑥‖2 ≤ ‖𝐴‖𝑠‖𝐵‖𝑠‖𝑥‖2.
Dividing both sides by ‖𝑥‖2 and taking the maximum over all 𝑥 ≠ 0 yields the result. □
Changelog
• May 11, 2025: Added discussion of non-symmetric case for L17.3.3.
• May 15, 2025: Fixed a typo in L17.3.3 (thanks Alina Yang!)
• May 15, 2025: Fixed a typo (thanks George Cao!)
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹In what follows, I will use ⪯ 𝜆𝐼 and ⪰ 𝜆𝐼 to simply refer to an upper/lower bound of 𝜆 on all eigenvalues.
²Please let me know if you know how to do this directly!
³By strong convexity, the function must attain one minimum, which is also unique.

Content

Typo or question?

Metadata