MIT 6.7220/15.084 โ Nonlinear Optimization (Spring โ25) Tue, Apr 22nd 2025
Lecture 17
Hessians, preconditioning, and Newton's method
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
So far, we have been concerned with first-order methods, that is, optimization methods that
use gradient information. With today, we will start discussing second-order methods, which
use not only the gradient but also the Hessian (second-order derivative) of the function
to be optimized. For that, we will now restrict our attention to optimization problems of
the form
min
๐ฅ ๐(๐ฅ) where ๐ฅ โ โ๐,
where ๐(๐ฅ) is a twice-differentiable function.
L17.1 From first-order to second-order Taylor approximations
As we mentioned in Lecture 12 when introducing the gradient descent algorithm, a funda-
mental idea for constructing optimization algorithms is to approximate the function to be
optimized by a simpler function that is easier to minimize. In the case of gradient descent, we
picked a direction of movement based on the minimum of the first-order Taylor expansion of
the objective function around the current point. In the case of twice-differentiable functions,
we instead pick the direction of movement by looking at the second-order Taylor expansion
of the objective, which is a more faithful approximation of the function.
The second-order Taylor expansion of a function ๐(๐ฅ) around a point ๐ฅ๐ก is given by
๐(๐ฅ) โ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ
where โ2๐(๐ฅ๐ก) is the Hessian matrix of ๐ at ๐ฅ๐ก. The minimum of ๐(๐ฅ) can be found in
closed form by setting the gradient (with respect to ๐ฅ) of the above expression to zero,
which gives
โ๐(๐ฅ๐ก) + โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก) = 0 โน ๐ฅ = ๐ฅ๐ก โ [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก).
So, we find that by moving from first-order to second-order Taylor approximation, and
assuming that โ2๐(๐ฅ๐ก) is invertible, the natural direction of descent changes from
๐ = โโ๐(๐ฅ๐ก)โ&'&(
using first-order
Taylor approximation
to ๐ = โ[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โ&&&&'&&&&(
using second-order Taylor approximation
.
L17.1.1 Second-order direction as an example of preconditioning
The second-order direction of descent is obtained by multiplying the negative gradient (that
is, the first-order direction of descent) by the inverse of the Hessian matrix. This operation
is known as preconditioning the gradient by the Hessian. The Hessian preconditioning of
the gradient direction has the effect of making the optimization problem affinely invariant.
In other words, if we apply an affine transformation to the objective function and the initial
point, the second-order direction of descent will also be automatically transformed.
To demonstrate this property, consider the optimization problem
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ โ๐
and suppose that the coordinates of ๐ฅ have been reparametrized via a new coordinate
system ๐ฆ where the correspondence to ๐ฅ is given by ๐ฅ โ ๐ด๐ฆ + ๐.
In the coordinate system of ๐ฆ, the function being minimized is ๐(๐ฆ) โ ๐(๐ด๐ฆ + ๐); so, the
gradient and Hessian of ๐ can be computed as:
โ๐ฆ๐(๐ฆ) = ๐ดโคโ๐ฅ๐(๐ฅ), โ2
๐ฆ๐(๐ฆ) = ๐ดโคโ2
๐ฅ๐(๐ฅ)๐ด
So,
โ[โ2
๐ฆ๐(๐ฆ)]โ1โ๐ฆ๐(๐ฆ)
โ&&&'&&&(
๐๐ฆ
= โ[๐ดโคโ2
๐ฅ๐(๐ฅ)๐ด]โ1(๐ดโคโ๐ฅ๐(๐ฅ))
= โ๐ดโ1[โ2
๐ฅ๐(๐ฅ)]โ1๐ดโโค๐ดโคโ๐ฅ๐(๐ฅ)
= ๐ดโ1 โ (โ[โ2
๐ฅ๐(๐ฅ)]โ1โ๐ฅ๐(๐ฅ))
โ&&&&'&&&&(
๐๐ฅ
,
Hence, the second-order descent directions, measured with respect to ๐ฆ and ๐ฅ, satisfy
๐๐ฅ โ ๐ด๐๐ฆ,
which mimics the correspondence ๐ฅ โ ๐ด๐ฆ + ๐. So, for example, the update ๐ฅโฒ = ๐ฅ โ ๐๐๐ฅ in
the ๐ฅ coordinate system corresponds to the update
๐ฅโฒ = (๐ด๐ฆ + ๐) โ ๐๐ด๐๐ฆ = ๐ด(๐ฆ โ ๐๐๐ฆ) + ๐ = ๐ด๐ฆโฒ + ๐
in the ๐ฆ coordinate system, preserving the correspondence at all steps. The same property
does not hold for gradient descent. We will have a more in-depth look into preconditioning
in the next Lecture.
L17.2 From gradient descent to Newtonโs method
Having established the second-order direction of descent, we can define the natural gener-
alization of the gradient descent algorithm to the second-order setting. This algorithm,
which takes the name damped Newtonโs method, is given by the update rule
๐ฅ๐ก+1 = ๐ฅ๐ก โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) . (1)
For the choice ๐ = 1, which corresponds to setting ๐ฅ๐ก+1 to be the minimum of the second-
order approximation centered at ๐ฅ๐ก at each iteration, this algorithm is known simply as
Newtonโs method.
L17.2.1 Failure mode: lack of curvature
Given that the update rule of Newtonโs method involves the inverse of the Hessian, a
natural concern is what happens when the Hessian is singular or near-singular (for example,
when one eigenvalue is extremely close to zero). Such a situation corresponds to the case
where the function has very little curvature, and the second-order approximation is not a
good approximation of the function. We now show that these concerns are well-founded by
considering a function whose curvature away from the minimum decays extremely fast.
Example L17.1. Consider the function
โ2 โ1.5 โ1 โ0.5 0.5 1 1.5 2 x
1
2
3
4
0
๐(๐ฅ) = log(๐2๐ฅ + ๐โ2๐ฅ),
plotted on the right, whose gradient
and Hessian are respectively computed
as
โ๐(๐ฅ) = 2 โ ๐4๐ฅ โ 1
๐4๐ฅ + 1 ,
โ2๐(๐ฅ) = 16 โ ๐4๐ฅ
(๐4๐ฅ + 1)2 .
The two tables below show the first 10 iterates of Newtonโs method and gradient descent
when applied to ๐(๐ฅ) starting at two close initial points: ๐ฅ0 = 0.5 on the left, and ๐ฅ0 =
0.7 on the right. As you can see, the behavior of Newtonโs method is very different: while
it converges extremely quickly to the minimum when starting at ๐ฅ0 = 0.5, it diverges
when starting at ๐ฅ0 = 0.7.
Newtonโs method GD (๐ = 0.1)
๐ก = 0 0.5000 0.5000
๐ก = 1 โ0.4067 0.3477
๐ก = 2 0.2047 0.2274
๐ก = 3 โ0.0237 0.1422
๐ก = 4 3.53 ร 10โ5 0.0868
๐ก = 5 โ1.17 ร 10โ13 0.0524
๐ก = 6 โ1.14 ร 10โ17 0.0315
๐ก = 7 0.0000 0.0189
๐ก = 8 0.0000 0.0114
Newtonโs method GD (๐ = 0.1)
๐ก = 0 0.7000 0.7000
๐ก = 1 โ1.3480 0.5229
๐ก = 2 26.1045 0.3669
๐ก = 3 โ2.79 ร 1044 0.2418
๐ก = 4 diverged 0.1520
๐ก = 5 diverged 0.0930
๐ก = 6 diverged 0.0562
๐ก = 7 diverged 0.0338
๐ก = 8 diverged 0.0203
In the next section, we will analyze the convergence properties of Newtonโs method formally.
L17.3 Analysis of Newtonโs method
Example L17.1 shows that Newtonโs method breaks down in the absence of sufficient
curvature. However, it also showed that the method can be extremely efficient when the
starting point is close to the minimum and the minimum has enough curvature. In this
section, we formalize these positive observations quantitatively.
L17.3.1 A key lemma
The analysis of Newtonโs method and damped Newtonโs method hinges on the following
lemma, which relates the geometric shrinking in the distance to optimality of the iterates
to the spectral norm of a specific matrix that depends on the Hessian of the function.
Theorem L17.1 (Newtonโs method lemma). Let ๐ : โ๐ โ โ be twice differentiable with
invertible Hessian, and let ๐ฅโ be a local minimum of ๐. The distance to optimality of
the iterates ๐ฅ๐ก generated by the damped Newtonโs method with stepsize ๐ > 0 satisfy
๐ฅ๐ก+1 โ ๐ฅโ = (๐ผ โ ๐๐ป๐ก)(๐ฅ๐ก โ ๐ฅโ),
where
๐ป๐ก โ [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ)) d๐.
Proof. The fundamental idea of the proof is to write the second-order direction as a
function of the Hessian matrices on the segment connecting ๐ฅโ to ๐ฅ๐ก.
In particular, we have
[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) = [โ2๐(๐ฅ๐ก)]โ1(โ๐(๐ฅ๐ก) โ โ๐(๐ฅโ)) (since โ๐(๐ฅโ) = 0
= [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))(๐ฅ๐ก โ ๐ฅโ) d๐
= ([โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ)) d๐)(๐ฅ๐ก โ ๐ฅโ)
= ๐ป๐ก(๐ฅ๐ก โ ๐ฅโ).
Hence, substituting this expression in the update rule of damped Newtonโs method,
we find
๐ฅ๐ก+1 โ ๐ฅโ = (๐ฅ๐ก โ ๐ฅโ) โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) = (๐ผ โ ๐๐ป๐ก)(๐ฅ๐ก โ ๐ฅโ),
as we wanted to show. โก
The previous lemma shows that as long as the spectral norm of ๐ผ โ ๐๐ป๐ก is less than 1 (that
is, the maximum absolute value of any eigenvalue of ๐ผ โ ๐๐ป๐ก is less than 1), the distance to
optimality of the iterates ๐ฅ๐ก generated by Newtonโs method will decay exponentially fast.
We will leverage this result in the next two subsection to give local and global convergence
guarantees under different hypotheses.
L17.3.2 First corollary: Local convergence guarantees
As a first corollary of the previous lemma, we can derive a convergence guarantee for
Newtonโs method when starting from a point that is โclose enoughโ to a minimum with
sufficient curvature. This type of guarantee is known as a local convergence guarantee, as
it only applies to points that are within a certain distance from the solution. An empirical
illustration of this guarantee is shown in Example L17.1 when starting from the initial point
๐ฅ0 = 0.5 (left table of the example).
In particular, let ๐ฅโ be a local minimum of ๐ with strong curvature, that is, a point such that
โ๐(๐ฅโ) = 0, and โ2๐(๐ฅโ) โชฐ ๐๐ผ (2)
for some ๐ > 0. Furthermore, assume that ๐ is smooth, in the sense that its Hessian is ๐ -
Lipschitz continuous, that is
โโ2๐(๐ฅ) โ โ2๐(๐ฆ)โ๐ โค ๐ โ โ๐ฅ โ ๐ฆโ2. (3)
Our analysis will rest on the following high-level idea.
Since the Hessian at ๐ฅโ is โชฐ ๐๐ผ and the Hessian is ๐ -Lipschitz continuousโand
therefore cannot change too fastโthen we can determine a neighborhood of points
โsufficiently closeโ to ๐ฅโ with strong curvature. This will allow us to upper bound the
spectral norm of ๐ผ โ ๐ป๐ก and invoke the general result of Theorem L17.1.
We make the above idea formal in the following.
Theorem L17.2. Under the assumptions (2) and (3) above, the spectral norm of the
matrix ๐ผ โ ๐ป๐ก induced at time ๐ก by the iterate ๐ฅ๐ก produced by Newtonโs method satisfies
โ๐ผ โ ๐ป๐กโ๐ โค ๐
๐ โ๐ฅ๐ก โ ๐ฅโโ2
whenever
โ๐ฅ๐ก โ ๐ฅโโ2 โค ๐
2๐ .
Proof. The key technique is to leverage the Lipschitz continuity of the Hessian to bound
the spectral norm of ๐ผ โ ๐ป๐ก. To do that, we can make a difference of Hessians appear in
the expression of ๐ผ โ ๐ป๐ก by rewriting the identity matrix as follows:
๐ผ โ ๐ป๐ก = ([โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅ๐ก) d๐) โ ๐ป๐ก
= [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
(โ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))) d๐.
So, taking spectral norms on both sides, we find
โ๐ผ โ ๐ป๐กโ๐ โค โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โซ
1
0
โโ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))โ๐ d๐
โค โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โซ
1
0
๐ (1 โ ๐)โ๐ฅ๐ก โ ๐ฅโโ2 d๐
= ๐
2 โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โ๐ฅ๐ก โ ๐ฅโโ2. (4)
To complete the proof, we only need to bound the spectral norm of [โ2๐(๐ฅ๐ก)]โ1. As
mentioned above, we will do so by using the fact that โ2๐(๐ฅโ) โชฐ ๐๐ผ and the Hessian
changes slowly. In particular, we have
โโ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ)โ๐ โค ๐ โ๐ฅ๐ก โ ๐ฅโโ2,
which implies that
โ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ โชฏ โ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ) โชฏ ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ.
In particular,
โ2๐(๐ฅ๐ก) โชฐ โ2๐(๐ฅโ) โ ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ โชฐ (๐ โ ๐ โ๐ฅ๐ก โ ๐ฅโโ2)๐ผ โชฐ ๐
2 ๐ผ,
where the last inequality used the hypothesis that โ๐ฅ๐ก โ ๐ฅโโ2 โค ๐
2๐ . Hence,
โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โค 2
๐ , and plugging this bound into (4) yields the statement. โก
In particular, since โ๐ผ โ ๐ป๐กโ๐ โค 1/2, then Theorem L17.1 guarantees that the distance to
optimality decreases exponentially. In fact, since Theorem L17.2 bounds the spectral norm
of ๐ผ โ ๐ป๐ก as a linear function of โ๐ฅ๐ก โ ๐ฅโโ2, the decrease is doubly exponential. We have
just shown the following.
Theorem L17.3. Let ๐ : โ๐ โ โ be twice differentiable with ๐ -Lipschitz continuous
Hessian, and let ๐ฅโ be a local minimum of ๐ with strong curvature, that is, a point
such that
โ๐(๐ฅโ) = 0, and โ2๐(๐ฅโ) โชฐ ๐๐ผ
for some ๐ > 0. Then, as long as we start Newtonโs method from a point ๐ฅ0 with
distance
โ๐ฅ0 โ ๐ฅโโ2 โค ๐
2๐
from the local minimum, the distance to optimality of the iterates ๐ฅ๐ก generated by
Newtonโs method decays as
โ๐ฅ๐ก+1 โ ๐ฅโโ2
๐/๐ โค (โ๐ฅ๐ก โ ๐ฅโโ2
๐/๐ )
2
.
L17.3.3 Second corollary: Global convergence for strong curvature
In general, obtaining global convergence guarantees for Newtonโs method is a much harder
task than obtaining local convergence guarantees. However, we can still obtain global
convergence guarantees for Newtonโs method when the function is both ๐-strongly convex
and ๐ฟ-smooth, that is,
๐๐ผ โชฏ โ2๐(๐ฅ) โชฏ ๐ฟ๐ผ โ๐ฅ โ โ๐.
In this case, we will seek to show that the distance to optimality of the iterates ๐ฅ๐ก generated
by Newtonโs method decays at a linear rate (i.e., exponentially fast).
Remark L17.1. The result is made somewhat less appealing by the fact that we already
know that gradient descent can reach a similar convergence rate for the same class of
smooth and strongly convex function, without even needing to invert the Hessian (see
Lecture 12, section on convergence rate for smooth Pล functions).
To see why there is hope for this, conside that under these assumptions,ยน
๐
๐ฟ๐ผ โชฏ [โ2๐(๐ฅ)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ โ ๐ฅโ)) d๐ โชฏ ๐ฟ
๐ ๐ผ
so that
(1 โ ๐ ๐ฟ
๐ )๐ผ โชฏ ๐ผ โ ๐๐ป๐ก โชฏ (1 โ ๐ ๐
๐ฟ )๐ผ.
Hence, if ๐ โค ๐
๐ฟ , we have
0 โชฏ ๐ผ โ ๐๐ป๐ก โชฏ (1 โ ๐ ๐
๐ฟ )๐ผ.
If ๐ป๐ก was symmetric, this would immediately imply โ๐ผ โ ๐๐ป๐กโ๐ โค 1 โ ๐ ๐
๐ฟ , implying contrac-
tion to the optimum at the rate of (1 โ ๐๐/๐ฟ)๐ก under Theorem L17.1. Unfortunately, it is
not immediately obvious how to extend the previous approach directly beyond the case of
symmetric ๐ป๐ก.ยฒ
However, we can still obtain a similar result in the general case with a little of additional
work, by introducing a preconditioned version of the gradient descent lemma. Indeed, by ๐ฟ
-smoothness, we can write
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2 โจ๐ฅ๐ก+1 โ ๐ฅ๐ก, ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ
โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2๐ โจ๐ฅ๐ก+1 โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ๐ก+1 โ ๐ฅ๐ก)โฉ
= ๐(๐ฅ๐ก) โ ๐โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ + ๐ฟ
2๐๐2โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ
So, when ๐ โค ๐/๐ฟ, we obtain
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2 โจโ๐(๐ฅ๐ก)โค, [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ. (5)
On the other hand, by the strong convexity of ๐, we have
๐(๐ฆ) โฅ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ + ๐
2 โจ๐ฆ โ ๐ฅ๐ก, ๐ฆ โ ๐ฅ๐กโฉ
โฅ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ + ๐
2๐ฟโจ๐ฆ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฆ โ ๐ฅ๐ก)โฉ โ๐ฆ โ โ๐.
Minimizing over ๐ฆ on both sides, we obtain
๐(๐ฅโ) โฅ ๐(๐ฅ๐ก) โ ๐ฟ
2๐ โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ,
โบ 1
2 โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ โฅ ๐
๐ฟ (๐(๐ฅ๐ก) โ ๐(๐ฅโ)). (6)
So, by combining the two inequalities (5) and (6), we find that
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐ ๐
๐ฟ (๐(๐ฅ๐ก) โ ๐(๐ฅโ))
โน ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )(๐(๐ฅ๐ก) โ ๐(๐ฅโ))
โน ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )๐ก
(๐(๐ฅ1) โ ๐(๐ฅโ)).
Since by strong convexity and ๐ฟ-smoothness
๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โฅ ๐
2 โ๐ฅ๐ก+1 โ ๐ฅโโ2
2, and ๐(๐ฅ1) โ ๐(๐ฅโ) โค ๐ฟ
2 โ๐ฅ1 โ ๐ฅโโ2
2,
we finally conclude the following.
Corollary L17.1. Let ๐ : โ๐ โ โ be twice differentiable, ๐-strongly convex and ๐ฟ-
smooth. Then, the distance to optimalityยณ of the iterates ๐ฅ๐ก generated by damped
Newtonโs method with stepsize ๐ โค ๐
๐ฟ decays exponentially fast at the rates
โ๐ฅ๐ก+1 โ ๐ฅโโ2
2 โค ๐ฟ
๐ (1 โ ๐ ๐
๐ฟ)๐ก
โ๐ฅ1 โ ๐ฅโโ2
2
and ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )๐ก
(๐(๐ฅ1) โ ๐(๐ฅโ)).
L17.4 Further readings
Further material on (damped) Newtonโs method can be found in Chapter 1.2.4 of Nesterov,
Y. [Nes18]โs book. The use of preconditioning in optimization is discussed in Chapter 1.3
of the same book, under the name โvariable metric methodโ.
โ Appendix: Cubic-regularized Newton method. As we have seen above, when the
Hessian matrix is ill conditioned, Newtonโs method might take huge steps and diverge. For
that reason, we were able to show global convergence only in the limited setting of strong
curvature (albeit at an extremely fast rate). We will see in a couple of lectures another
extremely important function class for which Newtonโs method can be shown to converge
globally: self-concordant functions.
Beyond these isolated classes of benign functions, one might wonder if second-order method
can be made more robust to ill-conditioned Hessian matrices. To resolve this question for
the positive, Nesterov, Y., & Polyak, B. T. [NP06] proposed a modification of Newtonโs
method that combines the idea of proximal step to control the step size with the second-
order information of the Hessian matrix. In particular, in our pursuit of a good second-order
method we would like to be able to trade off the following two objectives:
โข moving as much as possible in the direction of minimizing the second-order approxi-
mation
๐(๐ฅ) โ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2 โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ;
โข staying in a sufficiently small neighborhood of ๐ฅ๐ก, where the approximation is accurate.
The update rule ๐ฅ๐ก+1 = ๐ฅ๐ก โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) seen in (1) does not guarantee this,
as it might take large steps when the Hessian is ill conditioned.
As we saw already in Lecture 14, one way to capture this tension quantitatively is via the
notion of proximal step. For first-order methods, when using Euclidean distances to measure
the step size, this resulted in the choice of next iterate given by
๐ฅ๐ก+1 โ arg min
๐ฅ ๐โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2โ๐ฅ โ ๐ฅ๐กโ2
2,
which we saw was exactly equivalent to the gradient descent update rule ๐ฅ๐ก+1 = ๐ฅ๐ก โ
๐โ๐(๐ฅ๐ก). For second-order methods, a natural generalization of the approach results in the
update
๐ฅ๐ก+1 โ arg min
๐ฅ ๐(โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2 โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ) + 1
6 โ๐ฅ โ ๐ฅ๐กโ3
2 . (7)
The previous update rule is precisely the one proposed by Nesterov, Y., & Polyak, B. T.
[NP06]. We will call the resulting method cubic-regularized Newtonโs method. As shown
by Nesterov, Y., & Polyak, B. T. [NP06], this method guarantees global convergence for
a much broader class of functions than the ones we have seen so far, while still enjoying
quadratic convergence rates once the iterates are close enough to the solution. In addition
to the original paper by Nesterov, Y., & Polyak, B. T. [NP06], you can find a treatment of
cubic-regularized Newtonโs method in Chapter 4.1 of Nesterov, Y. [Nes18]โs book.
Bibliography for this lecture
[Nes18] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer International
Publishing. https://link.springer.com/book/10.1007/978-3-319-91578-4
[NP06] Nesterov, Y., & Polyak, B. T. (2006). Cubic regularization of Newton method
and its global performance. Math. Program., 108(1), 177โ205. https://doi.org/
10.1007/s10107-006-0706-8
L17.A Appendix: Spectral norms
The spectral norm of a matrix ๐ด measures the maximum increase in Euclidean norm that
๐ด can produce, that is,
โ๐ดโ๐ โ max
๐ฅโ 0
โ๐ด๐ฅโ2
โ๐ฅโ2
.
(This type of norm, defined as the maximum effect that a function can have on the norm
of an input, is known as an operator norm.) In other words, if โ๐ดโ๐ โค ๐, then we know that
โ๐ด๐ฅโ2 โค ๐โ๐ฅโ2
for all vectors ๐ฅ.
The spectral norm of a matrix is closely related to its eigenvalues. In particular, we have
the following characterization.
Theorem L17.4. For a symmetric matrix ๐ด, the spectral norm is equal to the maximum
absolute value of any eigenvalue of ๐ด.
Proof. The theorem is straightforward when ๐ด is a diagonal matrix, where the maximum
is attained by any vector supported only on the coordinate with the maximum absolute
eigenvalue.
To show the result in general, we can reduce to the diagonal case by considering the
eigendecomposition
๐ด = ๐โคฮ๐,
where ๐ is an orthogonal matrix (that is, โ๐๐ฃโ2 = โ๐โค๐ฃโ2 = โ๐ฃโ2 for all ๐ฃ) and ฮ is
a diagonal matrix with the eigenvalues of ๐ด on the diagonal (this decomposition exists
from the spectral theorem for symmetric real matrices). Then,
max
๐ฅโ 0
โ๐ด๐ฅโ2
โ๐ฅโ2
= max
๐ฅโ 0
โ๐โคฮ๐๐ฅโ2
โ๐ฅโ2
= max
๐ฅโ 0
โฮ๐๐ฅโ2
โ๐๐ฅโ2
,
where the last equality used the orthogonality of ๐ twice (once in the numerator to
remove on ๐โค, and once in the denominator to add a ๐). Since ๐ is invertible, we can
therefore operate a change of variable and write
max
๐ฅโ 0
โ๐โคฮ๐๐ฅโ2
โ๐๐ฅโ2
= max
๐ฆโ 0
โฮ๐ฆโ2
โ๐ฆโ2
.
The result then follows from the diagonal case. โก
An immediate consequence of the previous result is the following:
Corollary L17.2. Let ๐ด โ โ๐ร๐ be symmetric. Then, โ๐ดโ๐ โค ๐ if and only if
โ๐๐ผ โชฏ ๐ด โชฏ ๐๐ผ.
Proof. The condition โ๐๐ผ โชฏ ๐ด โชฏ ๐๐ผ is equivalent to asking that every eigenvalue of ๐ด
be in the range [โ๐, ๐]. The result then follows from the previous theorem. โก
Finally, we show that the spectral norm of a product of matrices is submultiplicative. This
follows pretty much directly from the definition of the spectral norm.
Theorem L17.5. For any matrices ๐ด, ๐ต โ โ๐ร๐, we have โ๐ด๐ตโ๐ โค โ๐ดโ๐ โ๐ตโ๐ .
Proof. Let ๐ฅ โ 0 be a vector. Then,
โ๐ด๐ต๐ฅโ2 โค โ๐ดโ๐ โ๐ต๐ฅโ2 โค โ๐ดโ๐ โ๐ตโ๐ โ๐ฅโ2.
Dividing both sides by โ๐ฅโ2 and taking the maximum over all ๐ฅ โ 0 yields the result. โก
Changelog
โข May 11, 2025: Added discussion of non-symmetric case for L17.3.3.
โข May 15, 2025: Fixed a typo in L17.3.3 (thanks Alina Yang!)
โข May 15, 2025: Fixed a typo (thanks George Cao!)
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนIn what follows, I will use โชฏ ๐๐ผ and โชฐ ๐๐ผ to simply refer to an upper/lower bound of ๐ on all eigenvalues.
ยฒPlease let me know if you know how to do this directly!
ยณBy strong convexity, the function must attain one minimum, which is also unique.
Lecture 17
Hessians, preconditioning, and Newton's method
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
So far, we have been concerned with first-order methods, that is, optimization methods that
use gradient information. With today, we will start discussing second-order methods, which
use not only the gradient but also the Hessian (second-order derivative) of the function
to be optimized. For that, we will now restrict our attention to optimization problems of
the form
min
๐ฅ ๐(๐ฅ) where ๐ฅ โ โ๐,
where ๐(๐ฅ) is a twice-differentiable function.
L17.1 From first-order to second-order Taylor approximations
As we mentioned in Lecture 12 when introducing the gradient descent algorithm, a funda-
mental idea for constructing optimization algorithms is to approximate the function to be
optimized by a simpler function that is easier to minimize. In the case of gradient descent, we
picked a direction of movement based on the minimum of the first-order Taylor expansion of
the objective function around the current point. In the case of twice-differentiable functions,
we instead pick the direction of movement by looking at the second-order Taylor expansion
of the objective, which is a more faithful approximation of the function.
The second-order Taylor expansion of a function ๐(๐ฅ) around a point ๐ฅ๐ก is given by
๐(๐ฅ) โ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ
where โ2๐(๐ฅ๐ก) is the Hessian matrix of ๐ at ๐ฅ๐ก. The minimum of ๐(๐ฅ) can be found in
closed form by setting the gradient (with respect to ๐ฅ) of the above expression to zero,
which gives
โ๐(๐ฅ๐ก) + โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก) = 0 โน ๐ฅ = ๐ฅ๐ก โ [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก).
So, we find that by moving from first-order to second-order Taylor approximation, and
assuming that โ2๐(๐ฅ๐ก) is invertible, the natural direction of descent changes from
๐ = โโ๐(๐ฅ๐ก)โ&'&(
using first-order
Taylor approximation
to ๐ = โ[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โ&&&&'&&&&(
using second-order Taylor approximation
.
L17.1.1 Second-order direction as an example of preconditioning
The second-order direction of descent is obtained by multiplying the negative gradient (that
is, the first-order direction of descent) by the inverse of the Hessian matrix. This operation
is known as preconditioning the gradient by the Hessian. The Hessian preconditioning of
the gradient direction has the effect of making the optimization problem affinely invariant.
In other words, if we apply an affine transformation to the objective function and the initial
point, the second-order direction of descent will also be automatically transformed.
To demonstrate this property, consider the optimization problem
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ โ๐
and suppose that the coordinates of ๐ฅ have been reparametrized via a new coordinate
system ๐ฆ where the correspondence to ๐ฅ is given by ๐ฅ โ ๐ด๐ฆ + ๐.
In the coordinate system of ๐ฆ, the function being minimized is ๐(๐ฆ) โ ๐(๐ด๐ฆ + ๐); so, the
gradient and Hessian of ๐ can be computed as:
โ๐ฆ๐(๐ฆ) = ๐ดโคโ๐ฅ๐(๐ฅ), โ2
๐ฆ๐(๐ฆ) = ๐ดโคโ2
๐ฅ๐(๐ฅ)๐ด
So,
โ[โ2
๐ฆ๐(๐ฆ)]โ1โ๐ฆ๐(๐ฆ)
โ&&&'&&&(
๐๐ฆ
= โ[๐ดโคโ2
๐ฅ๐(๐ฅ)๐ด]โ1(๐ดโคโ๐ฅ๐(๐ฅ))
= โ๐ดโ1[โ2
๐ฅ๐(๐ฅ)]โ1๐ดโโค๐ดโคโ๐ฅ๐(๐ฅ)
= ๐ดโ1 โ (โ[โ2
๐ฅ๐(๐ฅ)]โ1โ๐ฅ๐(๐ฅ))
โ&&&&'&&&&(
๐๐ฅ
,
Hence, the second-order descent directions, measured with respect to ๐ฆ and ๐ฅ, satisfy
๐๐ฅ โ ๐ด๐๐ฆ,
which mimics the correspondence ๐ฅ โ ๐ด๐ฆ + ๐. So, for example, the update ๐ฅโฒ = ๐ฅ โ ๐๐๐ฅ in
the ๐ฅ coordinate system corresponds to the update
๐ฅโฒ = (๐ด๐ฆ + ๐) โ ๐๐ด๐๐ฆ = ๐ด(๐ฆ โ ๐๐๐ฆ) + ๐ = ๐ด๐ฆโฒ + ๐
in the ๐ฆ coordinate system, preserving the correspondence at all steps. The same property
does not hold for gradient descent. We will have a more in-depth look into preconditioning
in the next Lecture.
L17.2 From gradient descent to Newtonโs method
Having established the second-order direction of descent, we can define the natural gener-
alization of the gradient descent algorithm to the second-order setting. This algorithm,
which takes the name damped Newtonโs method, is given by the update rule
๐ฅ๐ก+1 = ๐ฅ๐ก โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) . (1)
For the choice ๐ = 1, which corresponds to setting ๐ฅ๐ก+1 to be the minimum of the second-
order approximation centered at ๐ฅ๐ก at each iteration, this algorithm is known simply as
Newtonโs method.
L17.2.1 Failure mode: lack of curvature
Given that the update rule of Newtonโs method involves the inverse of the Hessian, a
natural concern is what happens when the Hessian is singular or near-singular (for example,
when one eigenvalue is extremely close to zero). Such a situation corresponds to the case
where the function has very little curvature, and the second-order approximation is not a
good approximation of the function. We now show that these concerns are well-founded by
considering a function whose curvature away from the minimum decays extremely fast.
Example L17.1. Consider the function
โ2 โ1.5 โ1 โ0.5 0.5 1 1.5 2 x
1
2
3
4
0
๐(๐ฅ) = log(๐2๐ฅ + ๐โ2๐ฅ),
plotted on the right, whose gradient
and Hessian are respectively computed
as
โ๐(๐ฅ) = 2 โ ๐4๐ฅ โ 1
๐4๐ฅ + 1 ,
โ2๐(๐ฅ) = 16 โ ๐4๐ฅ
(๐4๐ฅ + 1)2 .
The two tables below show the first 10 iterates of Newtonโs method and gradient descent
when applied to ๐(๐ฅ) starting at two close initial points: ๐ฅ0 = 0.5 on the left, and ๐ฅ0 =
0.7 on the right. As you can see, the behavior of Newtonโs method is very different: while
it converges extremely quickly to the minimum when starting at ๐ฅ0 = 0.5, it diverges
when starting at ๐ฅ0 = 0.7.
Newtonโs method GD (๐ = 0.1)
๐ก = 0 0.5000 0.5000
๐ก = 1 โ0.4067 0.3477
๐ก = 2 0.2047 0.2274
๐ก = 3 โ0.0237 0.1422
๐ก = 4 3.53 ร 10โ5 0.0868
๐ก = 5 โ1.17 ร 10โ13 0.0524
๐ก = 6 โ1.14 ร 10โ17 0.0315
๐ก = 7 0.0000 0.0189
๐ก = 8 0.0000 0.0114
Newtonโs method GD (๐ = 0.1)
๐ก = 0 0.7000 0.7000
๐ก = 1 โ1.3480 0.5229
๐ก = 2 26.1045 0.3669
๐ก = 3 โ2.79 ร 1044 0.2418
๐ก = 4 diverged 0.1520
๐ก = 5 diverged 0.0930
๐ก = 6 diverged 0.0562
๐ก = 7 diverged 0.0338
๐ก = 8 diverged 0.0203
In the next section, we will analyze the convergence properties of Newtonโs method formally.
L17.3 Analysis of Newtonโs method
Example L17.1 shows that Newtonโs method breaks down in the absence of sufficient
curvature. However, it also showed that the method can be extremely efficient when the
starting point is close to the minimum and the minimum has enough curvature. In this
section, we formalize these positive observations quantitatively.
L17.3.1 A key lemma
The analysis of Newtonโs method and damped Newtonโs method hinges on the following
lemma, which relates the geometric shrinking in the distance to optimality of the iterates
to the spectral norm of a specific matrix that depends on the Hessian of the function.
Theorem L17.1 (Newtonโs method lemma). Let ๐ : โ๐ โ โ be twice differentiable with
invertible Hessian, and let ๐ฅโ be a local minimum of ๐. The distance to optimality of
the iterates ๐ฅ๐ก generated by the damped Newtonโs method with stepsize ๐ > 0 satisfy
๐ฅ๐ก+1 โ ๐ฅโ = (๐ผ โ ๐๐ป๐ก)(๐ฅ๐ก โ ๐ฅโ),
where
๐ป๐ก โ [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ)) d๐.
Proof. The fundamental idea of the proof is to write the second-order direction as a
function of the Hessian matrices on the segment connecting ๐ฅโ to ๐ฅ๐ก.
In particular, we have
[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) = [โ2๐(๐ฅ๐ก)]โ1(โ๐(๐ฅ๐ก) โ โ๐(๐ฅโ)) (since โ๐(๐ฅโ) = 0
= [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))(๐ฅ๐ก โ ๐ฅโ) d๐
= ([โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ)) d๐)(๐ฅ๐ก โ ๐ฅโ)
= ๐ป๐ก(๐ฅ๐ก โ ๐ฅโ).
Hence, substituting this expression in the update rule of damped Newtonโs method,
we find
๐ฅ๐ก+1 โ ๐ฅโ = (๐ฅ๐ก โ ๐ฅโ) โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) = (๐ผ โ ๐๐ป๐ก)(๐ฅ๐ก โ ๐ฅโ),
as we wanted to show. โก
The previous lemma shows that as long as the spectral norm of ๐ผ โ ๐๐ป๐ก is less than 1 (that
is, the maximum absolute value of any eigenvalue of ๐ผ โ ๐๐ป๐ก is less than 1), the distance to
optimality of the iterates ๐ฅ๐ก generated by Newtonโs method will decay exponentially fast.
We will leverage this result in the next two subsection to give local and global convergence
guarantees under different hypotheses.
L17.3.2 First corollary: Local convergence guarantees
As a first corollary of the previous lemma, we can derive a convergence guarantee for
Newtonโs method when starting from a point that is โclose enoughโ to a minimum with
sufficient curvature. This type of guarantee is known as a local convergence guarantee, as
it only applies to points that are within a certain distance from the solution. An empirical
illustration of this guarantee is shown in Example L17.1 when starting from the initial point
๐ฅ0 = 0.5 (left table of the example).
In particular, let ๐ฅโ be a local minimum of ๐ with strong curvature, that is, a point such that
โ๐(๐ฅโ) = 0, and โ2๐(๐ฅโ) โชฐ ๐๐ผ (2)
for some ๐ > 0. Furthermore, assume that ๐ is smooth, in the sense that its Hessian is ๐ -
Lipschitz continuous, that is
โโ2๐(๐ฅ) โ โ2๐(๐ฆ)โ๐ โค ๐ โ โ๐ฅ โ ๐ฆโ2. (3)
Our analysis will rest on the following high-level idea.
Since the Hessian at ๐ฅโ is โชฐ ๐๐ผ and the Hessian is ๐ -Lipschitz continuousโand
therefore cannot change too fastโthen we can determine a neighborhood of points
โsufficiently closeโ to ๐ฅโ with strong curvature. This will allow us to upper bound the
spectral norm of ๐ผ โ ๐ป๐ก and invoke the general result of Theorem L17.1.
We make the above idea formal in the following.
Theorem L17.2. Under the assumptions (2) and (3) above, the spectral norm of the
matrix ๐ผ โ ๐ป๐ก induced at time ๐ก by the iterate ๐ฅ๐ก produced by Newtonโs method satisfies
โ๐ผ โ ๐ป๐กโ๐ โค ๐
๐ โ๐ฅ๐ก โ ๐ฅโโ2
whenever
โ๐ฅ๐ก โ ๐ฅโโ2 โค ๐
2๐ .
Proof. The key technique is to leverage the Lipschitz continuity of the Hessian to bound
the spectral norm of ๐ผ โ ๐ป๐ก. To do that, we can make a difference of Hessians appear in
the expression of ๐ผ โ ๐ป๐ก by rewriting the identity matrix as follows:
๐ผ โ ๐ป๐ก = ([โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
โ2๐(๐ฅ๐ก) d๐) โ ๐ป๐ก
= [โ2๐(๐ฅ๐ก)]โ1 โซ
1
0
(โ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))) d๐.
So, taking spectral norms on both sides, we find
โ๐ผ โ ๐ป๐กโ๐ โค โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โซ
1
0
โโ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ + ๐(๐ฅ๐ก โ ๐ฅโ))โ๐ d๐
โค โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โซ
1
0
๐ (1 โ ๐)โ๐ฅ๐ก โ ๐ฅโโ2 d๐
= ๐
2 โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โ โ๐ฅ๐ก โ ๐ฅโโ2. (4)
To complete the proof, we only need to bound the spectral norm of [โ2๐(๐ฅ๐ก)]โ1. As
mentioned above, we will do so by using the fact that โ2๐(๐ฅโ) โชฐ ๐๐ผ and the Hessian
changes slowly. In particular, we have
โโ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ)โ๐ โค ๐ โ๐ฅ๐ก โ ๐ฅโโ2,
which implies that
โ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ โชฏ โ2๐(๐ฅ๐ก) โ โ2๐(๐ฅโ) โชฏ ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ.
In particular,
โ2๐(๐ฅ๐ก) โชฐ โ2๐(๐ฅโ) โ ๐ โ๐ฅ๐ก โ ๐ฅโโ2๐ผ โชฐ (๐ โ ๐ โ๐ฅ๐ก โ ๐ฅโโ2)๐ผ โชฐ ๐
2 ๐ผ,
where the last inequality used the hypothesis that โ๐ฅ๐ก โ ๐ฅโโ2 โค ๐
2๐ . Hence,
โ[โ2๐(๐ฅ๐ก)]โ1โ๐
โค 2
๐ , and plugging this bound into (4) yields the statement. โก
In particular, since โ๐ผ โ ๐ป๐กโ๐ โค 1/2, then Theorem L17.1 guarantees that the distance to
optimality decreases exponentially. In fact, since Theorem L17.2 bounds the spectral norm
of ๐ผ โ ๐ป๐ก as a linear function of โ๐ฅ๐ก โ ๐ฅโโ2, the decrease is doubly exponential. We have
just shown the following.
Theorem L17.3. Let ๐ : โ๐ โ โ be twice differentiable with ๐ -Lipschitz continuous
Hessian, and let ๐ฅโ be a local minimum of ๐ with strong curvature, that is, a point
such that
โ๐(๐ฅโ) = 0, and โ2๐(๐ฅโ) โชฐ ๐๐ผ
for some ๐ > 0. Then, as long as we start Newtonโs method from a point ๐ฅ0 with
distance
โ๐ฅ0 โ ๐ฅโโ2 โค ๐
2๐
from the local minimum, the distance to optimality of the iterates ๐ฅ๐ก generated by
Newtonโs method decays as
โ๐ฅ๐ก+1 โ ๐ฅโโ2
๐/๐ โค (โ๐ฅ๐ก โ ๐ฅโโ2
๐/๐ )
2
.
L17.3.3 Second corollary: Global convergence for strong curvature
In general, obtaining global convergence guarantees for Newtonโs method is a much harder
task than obtaining local convergence guarantees. However, we can still obtain global
convergence guarantees for Newtonโs method when the function is both ๐-strongly convex
and ๐ฟ-smooth, that is,
๐๐ผ โชฏ โ2๐(๐ฅ) โชฏ ๐ฟ๐ผ โ๐ฅ โ โ๐.
In this case, we will seek to show that the distance to optimality of the iterates ๐ฅ๐ก generated
by Newtonโs method decays at a linear rate (i.e., exponentially fast).
Remark L17.1. The result is made somewhat less appealing by the fact that we already
know that gradient descent can reach a similar convergence rate for the same class of
smooth and strongly convex function, without even needing to invert the Hessian (see
Lecture 12, section on convergence rate for smooth Pล functions).
To see why there is hope for this, conside that under these assumptions,ยน
๐
๐ฟ๐ผ โชฏ [โ2๐(๐ฅ)]โ1 โซ
1
0
โ2๐(๐ฅโ + ๐(๐ฅ โ ๐ฅโ)) d๐ โชฏ ๐ฟ
๐ ๐ผ
so that
(1 โ ๐ ๐ฟ
๐ )๐ผ โชฏ ๐ผ โ ๐๐ป๐ก โชฏ (1 โ ๐ ๐
๐ฟ )๐ผ.
Hence, if ๐ โค ๐
๐ฟ , we have
0 โชฏ ๐ผ โ ๐๐ป๐ก โชฏ (1 โ ๐ ๐
๐ฟ )๐ผ.
If ๐ป๐ก was symmetric, this would immediately imply โ๐ผ โ ๐๐ป๐กโ๐ โค 1 โ ๐ ๐
๐ฟ , implying contrac-
tion to the optimum at the rate of (1 โ ๐๐/๐ฟ)๐ก under Theorem L17.1. Unfortunately, it is
not immediately obvious how to extend the previous approach directly beyond the case of
symmetric ๐ป๐ก.ยฒ
However, we can still obtain a similar result in the general case with a little of additional
work, by introducing a preconditioned version of the gradient descent lemma. Indeed, by ๐ฟ
-smoothness, we can write
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2 โจ๐ฅ๐ก+1 โ ๐ฅ๐ก, ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ
โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2๐ โจ๐ฅ๐ก+1 โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ๐ก+1 โ ๐ฅ๐ก)โฉ
= ๐(๐ฅ๐ก) โ ๐โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ + ๐ฟ
2๐๐2โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ
So, when ๐ โค ๐/๐ฟ, we obtain
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2 โจโ๐(๐ฅ๐ก)โค, [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ. (5)
On the other hand, by the strong convexity of ๐, we have
๐(๐ฆ) โฅ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ + ๐
2 โจ๐ฆ โ ๐ฅ๐ก, ๐ฆ โ ๐ฅ๐กโฉ
โฅ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ + ๐
2๐ฟโจ๐ฆ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฆ โ ๐ฅ๐ก)โฉ โ๐ฆ โ โ๐.
Minimizing over ๐ฆ on both sides, we obtain
๐(๐ฅโ) โฅ ๐(๐ฅ๐ก) โ ๐ฟ
2๐ โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ,
โบ 1
2 โจโ๐(๐ฅ๐ก), [โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก)โฉ โฅ ๐
๐ฟ (๐(๐ฅ๐ก) โ ๐(๐ฅโ)). (6)
So, by combining the two inequalities (5) and (6), we find that
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐ ๐
๐ฟ (๐(๐ฅ๐ก) โ ๐(๐ฅโ))
โน ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )(๐(๐ฅ๐ก) โ ๐(๐ฅโ))
โน ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )๐ก
(๐(๐ฅ1) โ ๐(๐ฅโ)).
Since by strong convexity and ๐ฟ-smoothness
๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โฅ ๐
2 โ๐ฅ๐ก+1 โ ๐ฅโโ2
2, and ๐(๐ฅ1) โ ๐(๐ฅโ) โค ๐ฟ
2 โ๐ฅ1 โ ๐ฅโโ2
2,
we finally conclude the following.
Corollary L17.1. Let ๐ : โ๐ โ โ be twice differentiable, ๐-strongly convex and ๐ฟ-
smooth. Then, the distance to optimalityยณ of the iterates ๐ฅ๐ก generated by damped
Newtonโs method with stepsize ๐ โค ๐
๐ฟ decays exponentially fast at the rates
โ๐ฅ๐ก+1 โ ๐ฅโโ2
2 โค ๐ฟ
๐ (1 โ ๐ ๐
๐ฟ)๐ก
โ๐ฅ1 โ ๐ฅโโ2
2
and ๐(๐ฅ๐ก+1) โ ๐(๐ฅโ) โค (1 โ ๐ ๐
๐ฟ )๐ก
(๐(๐ฅ1) โ ๐(๐ฅโ)).
L17.4 Further readings
Further material on (damped) Newtonโs method can be found in Chapter 1.2.4 of Nesterov,
Y. [Nes18]โs book. The use of preconditioning in optimization is discussed in Chapter 1.3
of the same book, under the name โvariable metric methodโ.
โ Appendix: Cubic-regularized Newton method. As we have seen above, when the
Hessian matrix is ill conditioned, Newtonโs method might take huge steps and diverge. For
that reason, we were able to show global convergence only in the limited setting of strong
curvature (albeit at an extremely fast rate). We will see in a couple of lectures another
extremely important function class for which Newtonโs method can be shown to converge
globally: self-concordant functions.
Beyond these isolated classes of benign functions, one might wonder if second-order method
can be made more robust to ill-conditioned Hessian matrices. To resolve this question for
the positive, Nesterov, Y., & Polyak, B. T. [NP06] proposed a modification of Newtonโs
method that combines the idea of proximal step to control the step size with the second-
order information of the Hessian matrix. In particular, in our pursuit of a good second-order
method we would like to be able to trade off the following two objectives:
โข moving as much as possible in the direction of minimizing the second-order approxi-
mation
๐(๐ฅ) โ ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2 โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ;
โข staying in a sufficiently small neighborhood of ๐ฅ๐ก, where the approximation is accurate.
The update rule ๐ฅ๐ก+1 = ๐ฅ๐ก โ ๐[โ2๐(๐ฅ๐ก)]โ1โ๐(๐ฅ๐ก) seen in (1) does not guarantee this,
as it might take large steps when the Hessian is ill conditioned.
As we saw already in Lecture 14, one way to capture this tension quantitatively is via the
notion of proximal step. For first-order methods, when using Euclidean distances to measure
the step size, this resulted in the choice of next iterate given by
๐ฅ๐ก+1 โ arg min
๐ฅ ๐โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2โ๐ฅ โ ๐ฅ๐กโ2
2,
which we saw was exactly equivalent to the gradient descent update rule ๐ฅ๐ก+1 = ๐ฅ๐ก โ
๐โ๐(๐ฅ๐ก). For second-order methods, a natural generalization of the approach results in the
update
๐ฅ๐ก+1 โ arg min
๐ฅ ๐(โจโ๐(๐ฅ๐ก), ๐ฅ โ ๐ฅ๐กโฉ + 1
2 โจ๐ฅ โ ๐ฅ๐ก, โ2๐(๐ฅ๐ก)(๐ฅ โ ๐ฅ๐ก)โฉ) + 1
6 โ๐ฅ โ ๐ฅ๐กโ3
2 . (7)
The previous update rule is precisely the one proposed by Nesterov, Y., & Polyak, B. T.
[NP06]. We will call the resulting method cubic-regularized Newtonโs method. As shown
by Nesterov, Y., & Polyak, B. T. [NP06], this method guarantees global convergence for
a much broader class of functions than the ones we have seen so far, while still enjoying
quadratic convergence rates once the iterates are close enough to the solution. In addition
to the original paper by Nesterov, Y., & Polyak, B. T. [NP06], you can find a treatment of
cubic-regularized Newtonโs method in Chapter 4.1 of Nesterov, Y. [Nes18]โs book.
Bibliography for this lecture
[Nes18] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer International
Publishing. https://link.springer.com/book/10.1007/978-3-319-91578-4
[NP06] Nesterov, Y., & Polyak, B. T. (2006). Cubic regularization of Newton method
and its global performance. Math. Program., 108(1), 177โ205. https://doi.org/
10.1007/s10107-006-0706-8
L17.A Appendix: Spectral norms
The spectral norm of a matrix ๐ด measures the maximum increase in Euclidean norm that
๐ด can produce, that is,
โ๐ดโ๐ โ max
๐ฅโ 0
โ๐ด๐ฅโ2
โ๐ฅโ2
.
(This type of norm, defined as the maximum effect that a function can have on the norm
of an input, is known as an operator norm.) In other words, if โ๐ดโ๐ โค ๐, then we know that
โ๐ด๐ฅโ2 โค ๐โ๐ฅโ2
for all vectors ๐ฅ.
The spectral norm of a matrix is closely related to its eigenvalues. In particular, we have
the following characterization.
Theorem L17.4. For a symmetric matrix ๐ด, the spectral norm is equal to the maximum
absolute value of any eigenvalue of ๐ด.
Proof. The theorem is straightforward when ๐ด is a diagonal matrix, where the maximum
is attained by any vector supported only on the coordinate with the maximum absolute
eigenvalue.
To show the result in general, we can reduce to the diagonal case by considering the
eigendecomposition
๐ด = ๐โคฮ๐,
where ๐ is an orthogonal matrix (that is, โ๐๐ฃโ2 = โ๐โค๐ฃโ2 = โ๐ฃโ2 for all ๐ฃ) and ฮ is
a diagonal matrix with the eigenvalues of ๐ด on the diagonal (this decomposition exists
from the spectral theorem for symmetric real matrices). Then,
max
๐ฅโ 0
โ๐ด๐ฅโ2
โ๐ฅโ2
= max
๐ฅโ 0
โ๐โคฮ๐๐ฅโ2
โ๐ฅโ2
= max
๐ฅโ 0
โฮ๐๐ฅโ2
โ๐๐ฅโ2
,
where the last equality used the orthogonality of ๐ twice (once in the numerator to
remove on ๐โค, and once in the denominator to add a ๐). Since ๐ is invertible, we can
therefore operate a change of variable and write
max
๐ฅโ 0
โ๐โคฮ๐๐ฅโ2
โ๐๐ฅโ2
= max
๐ฆโ 0
โฮ๐ฆโ2
โ๐ฆโ2
.
The result then follows from the diagonal case. โก
An immediate consequence of the previous result is the following:
Corollary L17.2. Let ๐ด โ โ๐ร๐ be symmetric. Then, โ๐ดโ๐ โค ๐ if and only if
โ๐๐ผ โชฏ ๐ด โชฏ ๐๐ผ.
Proof. The condition โ๐๐ผ โชฏ ๐ด โชฏ ๐๐ผ is equivalent to asking that every eigenvalue of ๐ด
be in the range [โ๐, ๐]. The result then follows from the previous theorem. โก
Finally, we show that the spectral norm of a product of matrices is submultiplicative. This
follows pretty much directly from the definition of the spectral norm.
Theorem L17.5. For any matrices ๐ด, ๐ต โ โ๐ร๐, we have โ๐ด๐ตโ๐ โค โ๐ดโ๐ โ๐ตโ๐ .
Proof. Let ๐ฅ โ 0 be a vector. Then,
โ๐ด๐ต๐ฅโ2 โค โ๐ดโ๐ โ๐ต๐ฅโ2 โค โ๐ดโ๐ โ๐ตโ๐ โ๐ฅโ2.
Dividing both sides by โ๐ฅโ2 and taking the maximum over all ๐ฅ โ 0 yields the result. โก
Changelog
โข May 11, 2025: Added discussion of non-symmetric case for L17.3.3.
โข May 15, 2025: Fixed a typo in L17.3.3 (thanks Alina Yang!)
โข May 15, 2025: Fixed a typo (thanks George Cao!)
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนIn what follows, I will use โชฏ ๐๐ผ and โชฐ ๐๐ผ to simply refer to an upper/lower bound of ๐ on all eigenvalues.
ยฒPlease let me know if you know how to do this directly!
ยณBy strong convexity, the function must attain one minimum, which is also unique.