Gabriele Farina - Gradient descent and descent lemmas

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Tue, Apr 1st 2025
Lecture 12
Gradient descent and descent lemmas
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
With this lecture, we start exploring first-order optimization methods for nonlinear
optimization. We begin our exploration in the case of minimization of unconstrained differ-
entiable functions, that is, optimization problems of the form
min
𝑥
s.t.
𝑓(𝑥)
𝑥 ∈ ℝ𝑛.
In this case, as we mentioned several times (see, for example, Lecture 2), a point 𝑥⋆ ∈ ℝ𝑛
solves the optimization problem only if ∇𝑓(𝑥⋆) = 0. Furthermore, when 𝑓 is convex, the
necessary condition also becomes sufficient (Lecture 4).
L12.1 The fundamental idea of gradient descent
The function 𝑓 might be complicated. A fundamental idea for constructing an optimization
algorithm is to approximate 𝑓 with tractable simpler models and take steps assuming the
model is locally accurate at the current point 𝑥𝑡. For gradient descent and, more generally,
first-order methods, the local model of the function is obtained from truncating the Taylor
approximation of 𝑓 around the current point 𝑥𝑡 to its first order:
𝑓(𝑥) ≈ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ for all 𝑥 “close to” 𝑥𝑡.
The idea of gradient descent is then to move in the direction that minimizes the approxi-
mation of the objective above, that is, move a certain amount 𝜂 > 0 in the direction −∇𝑓(𝑥𝑡)
of steepest descent of the function:
𝑥𝑡+1 ≔ 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡) . (1)
We will call 𝑥0 the initial point, and the parameter 𝜂 > 0 the “stepsize” or “learning rate”.
Intuitively, the safe magnitude of 𝜂 must be determined as a function of how quickly the
first-order expansion becomes inaccurate.
L12.2 Analysis for 𝐿-smooth functions
Since gradient descent operates by considering the first-order approximation of 𝑓, it is
natural that a key quantity in the analysis of the algorithm has to do with how quickly
the first-order approximation becomes inaccurate. In particular, when replacing 𝑓 with its
first-order approximation, we are imagining that the gradient of 𝑓 is constant at all points
—that is, the graph of 𝑓 has no curvature. For how far around a given points 𝑥𝑡 is this
approximation sensible? A very natural answer to this question is given by the smoothness
of 𝑓.
In other words, a standard avenue for studying gradient descent passes through requiring a
bound on how fast the function’s gradient can change by moving slightly in any direction.
This implies that as we start taking a movement in the direction −∇𝑓(𝑥𝑡) from 𝑥𝑡, the
new gradients we will encounter on the way will still be mostly aligned with ∇𝑓(𝑥𝑡), and
so −∇𝑓(𝑥𝑡) will continue to be a good descent direction for a while.
L12.2.1 𝐿-smoothness
Specifically, we will require that the gradient ∇𝑓(𝑥) be 𝐿-Lipschitz continuous for some
constant 𝐿 ≥ 0. This condition is often called 𝐿-smoothness in the literature. We present it
now for general functions with arbitrary convex domains Ω; today, we will only care about
the case Ω = ℝ𝑛.
Definition L12.1 (𝐿-smoothness). A differentiable function 𝑓 : Ω → ℝ is 𝐿-smooth if its
gradient is 𝐿-Lipschitz continuous, that is,
‖∇𝑓(𝑥) − ∇𝑓(𝑦)‖2 ≤ 𝐿‖𝑥 − 𝑦‖2 ∀𝑥, 𝑦 ∈ Ω.
An immediate consequence of 𝐿-smoothness is that the function admits a quadratic upper
bound. This property will come in extremely handy in the analysis.
Theorem L12.1 (Quadratic upper bound). Let 𝑓 : Ω → ℝ be 𝐿-smooth on a convex
domain Ω. Then, we can upper bound the function 𝑓 as
𝑓(𝑦) ≤ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ + 𝐿
2 ‖𝑦 − 𝑥‖2
2 ∀𝑥, 𝑦 ∈ Ω. (2)
Proof. The idea is simple: we express the growth 𝑓(𝑦) − 𝑓(𝑥) as the integral of the gradient
on the line connecting 𝑥 to 𝑦, and we then use the Lipschitzness bound on the growth of
the gradient ∇𝑓 to bound the growth:
𝑓(𝑦) − 𝑓(𝑥) = ∫
1
0
⟨∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)), 𝑦 − 𝑥⟩ d𝑡
= (∫
1
0
⟨∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩ d𝑡) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
≤ (∫
1
0
‖∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − ∇𝑓(𝑥)‖2 ⋅ ‖𝑦 − 𝑥‖2 d𝑡) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
≤ (∫
1
0
𝑡𝐿‖𝑦 − 𝑥‖2
2 d𝑡) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
= 𝐿
2 ‖𝑦 − 𝑥‖2
2 + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩.
Rearranging, we obtain the statement. □
We also mention the following characterization.
Theorem L12.2. For twice differentiable functions 𝑓 : Ω → ℝ defined on an open set Ω ⊆
ℝ𝑛, an equivalent condition of 𝐿-smoothness is
−𝐿𝐼 ⪯ ∇2𝑓(𝑥) ⪯ 𝐿𝐼 ∀𝑥 ∈ Ω,
or equivalently,
|𝑣⊤∇2𝑓(𝑥)𝑣| ≤ 𝐿 ∀𝑥 ∈ Ω, 𝑣 ∈ ℝ𝑛 : ‖𝑣‖2 = 1.
L12.2.2 Convergence in gradient norm: The gradient descent lemma
We can use the quadratic upper bound (Theorem L12.1) to quantify the improvement of
one step of the gradient descent algorithm. This bound on the improvement is often called
the gradient descent lemma.
Theorem L12.3 (Gradient descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth. Then, for any
0 < 𝜂 ≤ 1
𝐿 , each step of gradient descent (1) guarantees
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2.
Proof. We start by writing (2) for the choice 𝑥 = 𝑥𝑡, 𝑦 = 𝑥𝑡+1:
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡+1 − 𝑥𝑡⟩ + 𝐿
2 ‖𝑥𝑡+1 − 𝑥𝑡‖2
2.
Plugging in the gradient descent step 𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡), we therefore obtain
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂‖∇𝑓(𝑥𝑡)‖2
2 + 𝐿
2 𝜂2‖∇𝑓(𝑥𝑡)‖2
2
= 𝑓(𝑥𝑡) − 𝜂(1 − 𝐿
2 𝜂)‖∇𝑓(𝑥𝑡)‖2
2.
If 𝜂 ≤ 1
𝐿 , the parenthesis is 1 − 𝐿
2 𝜂 ≥ 1
2 , and the result follows. □
The previous result shows that for 𝐿-smooth functions, there exists a good choice of learning
rate (namely, 𝜂 = 1
𝐿 ) such that each step of gradient descent guarantees to improve the
function value if the current point does not have a zero gradient.
Assuming that 𝑓 is lower bounded, the gradient descent lemma also implies that the
gradients of the points produced by gradient descent must eventually become small. Indeed,
imagine running the algorithm for 𝑇 iterations. Then, by repeatedly using the gradient
descent lemma, we would have
𝑓(𝑥𝑇 ) ≤ 𝑓(𝑥0) − 𝜂
2 ∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2.
Assume now that 𝑓 is lower bounded by 𝑓⋆ ≔ inf 𝑓. Since 𝑓(𝑥𝑇 ) ≥ 𝑓⋆, the above implies that
∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2 ≤ 2
𝜂 (𝑓(𝑥0) − 𝑓⋆).
So, at least one among the {‖∇𝑓(𝑥𝑡)‖2
2}𝑇 −1
𝑡=0 must be upper bounded by 2
𝜂𝑇 (𝑓(𝑥0) − 𝑓⋆).
More formally, we have the following.
Theorem L12.4. Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth, and let 0 < 𝜂 ≤ 1
𝐿 . Then, by running
gradient descent for 𝑇 iterations, at least one of the points {𝑥𝑡} encountered must
satisfy
‖∇𝑓(𝑥𝑡)‖2 ≤ √2 ⋅ 𝑓(𝑥0) − 𝑓⋆
𝜂𝑇 .
We remark that the above result does not require convexity of 𝑓.
L12.2.3 Convergence in convex function value: The Euclidean mirror de-
scent lemma
The result in Theorem L12.4 only guarantees a convergence rate (to 0) for the objective’s
gradient norm, not of the suboptimality gap 𝑓(𝑥𝑡) − 𝑓⋆ as measured by the function value.
In other words, while a small gradient indicates a condition of almost local optimality, it
does not necessarily mean that the function value is close to optimal. To see this, consider
the function 𝑓(𝑥) = 𝜀 log(1 + 𝑒𝑥). The gradient satisfies ∇𝑓(𝑥) ≤ 𝜀 for all 𝑥 ∈ ℝ, and so 𝑥
might be arbitrarily far from optimal while the gradient is small.
However, for convex functions, a convergence rate in function value can be established, as
we now show.
■ Step I: The Euclidean mirror descent lemma. To give guarantees on the function value,
we will now further assume that 𝑓 is convex, and we will leverage what we will call the
Euclidean mirror descent lemma. In a future lecture, we will see a more general version of
this lemma, which we will call mirror descent lemma without further qualifications.
Theorem L12.5 (Euclidean mirror descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be a convex and
differentiable function. Then, for any choice of stepsize 𝜂, any two consecutive points
(𝑥𝑡, 𝑥𝑡+1) produced by the gradient descent algorithm (1) satisfy
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2) ∀𝑦 ∈ ℝ𝑛
Proof. The proof rests on the following crucial observation, often called the three-point
equality, which can be checked by expanding the squared norms: for any 𝑦 ∈ ℝ𝑛,
⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩ = − 1
2 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2).
Since 𝑓 is convex by hypothesis, it satisfies the linear lower bound we saw in Lecture 4. In
particular, we can lower bound the value of 𝑓(𝑦), for any 𝑦 ∈ ℝ𝑛, with the linearization
at the point 𝑥𝑡, obtaining
𝑓(𝑦) ≥ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩.
Plugging in the relationship (1), that is, 𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡), we then obtain
𝑓(𝑦) ≥ 𝑓(𝑥𝑡) + 1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩.
Substituting the critical observation above and rearranging, we obtain
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2) ∀𝑦 ∈ ℝ𝑛,
which is the statement. □
■ Step II: Telescoping. We can use the Euclidean mirror descent lemma above to get a
rate of decrease in terms of the objective function value 𝑓(𝑥𝑡).
Two steps are necessary: first, since 𝑥𝑡+1 − 𝑥𝑡 = −𝜂∇𝑓(𝑥𝑡), by definition of the gradient
descent algorithm (1), we can replace the term ‖𝑥𝑡+1 − 𝑥𝑡‖2
2 in the statement of Theorem
L12.5 with 𝜂2‖∇𝑓(𝑥𝑡)‖2
2, obtaining
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2) + 𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2 ∀𝑦 ∈ ℝ𝑛.
Assuming 𝑓 is 𝐿-smooth and 𝜂 ≤ 1
𝐿 , we can then use the gradient descent lemma (Theorem
L12.3) to bound
𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1),
obtaining the following corollary.
Corollary L12.1. Let 𝑓 : ℝ𝑛 → ℝ be convex and 𝐿-smooth, and 0 < 𝜂 ≤ 1
𝐿 . Then, any
two consecutive points (𝑥𝑡, 𝑥𝑡+1) produced by gradient descent satisfy
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2) ∀𝑦 ∈ ℝ𝑛.
In particular, if 𝑓 has a minimizer 𝑥⋆, the above corollary implies that, whenever 𝜂 ≤ 1
𝐿 ,
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥⋆) + 1
2𝜂 (‖𝑥⋆ − 𝑥𝑡‖2
2 − ‖𝑥⋆ − 𝑥𝑡+1‖2
2).
Summing over 𝑡 = 0, ..., 𝑇 − 1, and noticing that the right-hand side is telescopic, we have
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡+1) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥⋆ − 𝑥0‖2
2.
Since 𝑓(𝑥𝑡) is nonincreasing in 𝑡 by the gradient descent lemma, then ∑𝑇 −1
𝑡=0 𝑓(𝑥𝑡+1) ≥
𝑇 𝑓(𝑥𝑇 ), and so we can write
𝑇 𝑓(𝑥𝑇 ) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥⋆ − 𝑥0‖2
2 ⟹ 𝑓(𝑥𝑇 ) ≤ 𝑓(𝑥⋆) + 1
2𝑇 𝜂 ‖𝑥⋆ − 𝑥0‖2
2.
So, we have proved the following.
Theorem L12.6. Let 𝑓 : ℝ𝑛 → ℝ be convex and 𝐿-smooth with minimizer 𝑥⋆ ∈ ℝ𝑛, and
0 < 𝜂 ≤ 1
𝐿 . The 𝑡-th point 𝑥𝑡 ∈ ℝ𝑛 produced by gradient descent satisfies
𝑓(𝑥𝑡) − 𝑓(𝑥⋆) ≤ ‖𝑥⋆ − 𝑥0‖2
2
2𝑡𝜂 ∀𝑡 = 1, 2, 3, ...
L12.2.4 Convergence in iterates
One might wonder whether Theorem L12.6 also implies that the distance between 𝑥𝑡 and
𝑥⋆ shrinks at the same rate, say ‖𝑥𝑡 − 𝑥⋆‖2 ≤ 𝑂( 1
𝑡 ). Unfortunately, such a result is false
without a stronger assumption than 𝐿-smoothness. Indeed, as shown by Nesterov, Y. [Nes18]
(Theorem 2.1.7 in his book), the convergence to the optimal point might be arbitrarily slow.
L12.3 Faster convergence via Polyak-Łojasiewicz condition
We now show that under stronger assumptions on the function 𝑓, gradient descent can be
shown to converge in iterates to the optimal point, that is, not only the function value
converges to the optimum, but also ‖𝑥𝑡 − 𝑥⋆‖2 does. Furthermore, this convergence happens
within order log(1
𝜖 ) iterations, in contrast with what we saw in the previous section, where
poly(1/𝜀) iterates were needed to reach an approximation guarantee of 𝜀.
L12.3.1 The PŁ condition
While some course material considers strong convexity (defined in Lecture 4) to give
accelerated rates for gradient descent, we instead focus on a much more general condition
called the Polyak-Łojasiewicz (PŁ) condition, introduced independently by Polyak [Pol63]
and Łojasiewicz [Łoj63] in 1963.
Fundamentally, the PŁ condition establishes a lower bound on the norm of the gradient,
which increases as the point becomes more and more suboptimal.
Definition L12.2 (PŁ condition). A lower-bounded function 𝑓 : ℝ𝑛 → ℝ satisfies the
Polyak-Łojasiewicz (PŁ) condition if there exists 𝜇 > 0 such that
1
2‖∇𝑓(𝑥)‖2 ≥ 𝜇(𝑓(𝑥) − 𝑓⋆) ∀𝑥 ∈ ℝ𝑛,
where 𝑓⋆ ≔ inf 𝑓.
The benefits of this approach compared to the standard approach of considering strongly
convex functions are many:
• Strong convexity implies the PŁ condition but not vice versa. In particular,
Theorem L12.7. If 𝑓 : ℝ𝑛 → ℝ is twice differentiable and ∇2𝑓(𝑥) ⪰ 𝜇𝐼 for all 𝑥 ∈
ℝ𝑛 (that is, 𝑓 is 𝜇-strongly convex on ℝ𝑛), then 𝑓 satisfies the PŁ condition with
PŁ constant 𝜇.
[▷ You should try to prove this!] The converse is not true, since any constant function
trivially satisifies the PŁ condition but not strong convexity.
• The proof of the exponential rate is actually simpler in this higher generality.
• The PŁ condition does not imply convexity (unlike strong convexity).
L12.3.2 Gradient descent’s convergence rate with the PŁ condition
Let 𝑓 : ℝ𝑛 → ℝ be an 𝐿-smooth lower-bounded function that satisfies the PŁ condition for
some PŁ constant 𝜇 > 0. Using the gradient descent lemma (Theorem L12.3) and the PŁ
condition together, for 𝜂 = 1
𝐿 we have
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 1
2𝐿‖∇𝑓(𝑥𝑡)‖2
2
≤ 𝑓(𝑥𝑡) − 𝜇
𝐿 (𝑓(𝑥𝑡) − 𝑓⋆)
= (1 − 𝜇
𝐿 )𝑓(𝑥𝑡) + 𝜇
𝐿 𝑓⋆.
Subtracting 𝑓⋆ on both sides yields
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤ (1 − 𝜇
𝐿)(𝑓(𝑥𝑡) − 𝑓⋆).
Solving the recurrence, we have proved the following.
Theorem L12.8. Let 𝑓 : ℝ𝑛 → ℝ be an 𝐿-smooth function lower-bounded by 𝑓⋆ that
satisfies the PŁ condition for some PŁ constant 𝜇 > 0. Let stepsize 𝜂 = 1
𝐿 . The 𝑡-th
point 𝑥𝑡 ∈ ℝ𝑛 produced by gradient descent satisfies
𝑓(𝑥𝑡) − 𝑓⋆ ≤ (1 − 𝜇
𝐿 )𝑡
(𝑓(𝑥0) − 𝑓⋆).
L12.3.3 Convergence in iterates
We now establish that a function that satisfies the PŁ condition must attain a minimum
and that the iterates produced by gradient descent converge at a rate exponential in 𝑡.
We show this from first principles, following Polyak, B. T. [Pol63]‘s original proof.
Theorem L12.9. Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth, lower bounded by 𝑓⋆, and satisfy the
PŁ condition with PŁ constant 𝜇 > 0. Then, the sequence {𝑥𝑡} of iterates produced by
gradient descent instantiated with 0 < 𝜂 ≤ 1
𝐿 converges to some point 𝑥⋆ ∈ ℝ𝑛, with
‖𝑥𝑡 − 𝑥⋆‖2
2 ≤ 8𝜂𝐿2
𝜇2 (1 − 𝜇
𝐿 )𝑡−1
(𝑓(𝑥0) − 𝑓⋆).
Hence, by continuity and using Theorem L12.8, lim𝑡→∞ 𝑓(𝑥𝑡) = 𝑓(𝑥⋆) = 𝑓⋆.
Proof. At all times 𝑡, we have
‖𝑥𝑡+1 − 𝑥𝑡‖2
2 = 𝜂2‖∇𝑓(𝑥𝑡)‖2
2
≤ 2𝜂(𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1)) (gradient descent lemma)
≤ 2𝜂(𝑓(𝑥𝑡) − 𝑓⋆)
≤ 2𝜂(1 − 𝜇
𝐿 )𝑡
(𝑓(𝑥0) − 𝑓⋆) (from Theorem L12.8).
To simplify notation, let 𝛼2 ≔ (1 − 𝜇
𝐿 ). Clearly, 0 ≤ 𝛼 < 1. To show convergence of the
sequence {𝑥𝑡}, we will show that it is a Cauchy sequence and invoke the completeness of
ℝ𝑛. Specifically, for any 𝑚 < 𝑛,
‖𝑥𝑚 − 𝑥𝑛‖2 ≤ ∑
𝑛−1
𝑡=𝑚
‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ ∑
𝑛−1
𝑡=𝑚
𝛼𝑡
≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ 𝛼𝑚
1 − 𝛼. (geometric series of ratio < 1)
This shows that ‖𝑥𝑚 − 𝑥𝑛‖2 → 0 as 𝑚, 𝑛 → ∞; hence, {𝑥𝑡} is a Cauchy sequence (by
definition). By completeness of ℝ𝑛, it follows that 𝑥𝑡 converges to some 𝑥⋆. Letting 𝑚 =
𝑡 and 𝑛 → ∞ above,
‖𝑥𝑡 − 𝑥⋆‖2 ≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ 𝛼𝑡
1 − 𝛼.
Squaring, we obtain
‖𝑥𝑡 − 𝑥⋆‖2
2 ≤ 2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ (𝛼2)𝑡
(1 − 𝛼)2
≤ 8𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ (𝛼2)𝑡
𝛼2(1 − 𝛼2)2 . (since 1
(1 − 𝛼)2 ≤ 4
𝛼2(1 − 𝛼2)2 )
Plugging in the definition of 𝛼2 = (1 − 𝜇
𝐿 ), we finally obtain
‖𝑥𝑡 − 𝑥⋆‖2
2 = 8𝜂𝐿2
𝜇2 (1 − 𝜇
𝐿 )𝑡−1
(𝑓(𝑥0) − 𝑓⋆),
which is the statement. □
Further readings
The book by Nesterov, Y. [Nes18] is an authoritative reference for a comprehensive treat-
ment of optimization algorithms for convex and nonconvex functions.
For properties of functions that satisfy the PŁ condition, the paper by Karimi, H., Nutini,
J., & Schmidt, M. [KNS16] is an approachable source.
[Nes18] Nesterov, Y. (2018). Lectures on Convex Optimization. Springer International
Publishing. https://link.springer.com/book/10.1007/978-3-319-91578-4
[Pol63] Polyak, B. T. (1963). Gradient methods for the minimisation of functionals. Ussr
Computational Mathematics and Mathematical Physics, 3(4), 864–878. https://
doi.org/10.1016/0041-5553(63)90382-3
[Łoj63] Łojasiewicz, S. (1963). A topological property of real analytic subsets. Coll. Du
CNRS, Les équations Aux Dérivées Partielles, 117(87–89), 2.
[KNS16] Karimi, H., Nutini, J., & Schmidt, M. (2016). Linear Convergence of Gradient and
Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. Machine
Learning and Knowledge Discovery in Databases, 795–811. https://doi.org/10.
1007/978-3-319-46128-1_50
Changelog
• Mar 31, 2025: Fixed some typos (thanks Khizer Shahid for all the help!)
• Apr 1, 2025: Applied more suggestions (thanks Khizer Shahid!).
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.

Content

Typo or question?

Metadata