Gabriele Farina - Gradient descent and descent lemmas

MIT 6.7220/15.084 — Nonlinear Optimization Tue, Mar 5th 2024
Lecture 7
Gradient descent and descent lemmas
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
With this lecture, we start exploring first-order optimization methods for nonlinear optimization.
We begin our exploration in the case of minimization of unconstrained differentiable functions, that
is, optimization problems of the form
min
𝑥
s.t.
𝑓(𝑥)
𝑥 ∈ ℝ𝑛.
In this case, as we have mentioned several times (see, for example, Lecture 2), a point 𝑥⋆ ∈ ℝ𝑛 solves
the optimization problem only if ∇𝑓(𝑥) = 0. Furthermore, when 𝑓 is convex, the necessary condition
also becomes sufficient (Lecture 3).
1 The fundamental idea of gradient descent
The function 𝑓 might be complicated. A fundamental idea for constructing an optimization algo-
rithm is to approximate 𝑓 with tractable simpler models and take steps assuming the model is locally
accurate at the current point 𝑥𝑡. For gradient descent and, more generally, first-order methods, the
local model of the function is obtained from truncating the Taylor approximation of 𝑓 around the
current point 𝑥𝑡 to its first order:
𝑓(𝑥) ≈ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥 − 𝑥𝑡⟩ for all 𝑥 “close to” 𝑥𝑡.
The idea of gradient descent is then to move in the direction that minimizes the approximation
of the objective above, that is, move a certain amount 𝜂 > 0 in the direction −∇𝑓(𝑥𝑡) of steepest
descent of the function:
𝑥𝑡+1 ≔ 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡) . (1)
We will call 𝑥0 the initial point, and the parameter 𝜂 > 0 the “stepsize” or “learning rate”.
2 Analysis for 𝐿-smooth functions
In order to give concrete rates for the convergence of gradient descent, we will make some assump-
tions regarding the smoothness of the function. In particular, we will require a bound on how fast
the function’s gradient can change by moving slightly in any direction. This implies that as we start
taking a movement in the direction −∇𝑓(𝑥𝑡) from 𝑥𝑡, the new gradients we will encounter on the
way will still be mostly aligned with ∇𝑓(𝑥𝑡), and so −∇𝑓(𝑥𝑡) will continue to be a good descent
direction for a while.
2.1 𝐿-smoothness
Specifically, we will require that the gradient ∇𝑓(𝑥) be a 𝐿-Lipschitz continuous for some constant
𝐿 ≥ 0. This condition is often called 𝐿-smoothness in the literature. We present it now for general
functions with arbitrary convex domains Ω; today, we will only care about the case Ω = ℝ𝑛.
Definition 2.1 (𝐿-smoothness). A differentiable function 𝑓 : Ω → ℝ is 𝐿-smooth if its gradient
is 𝐿-Lipschitz continuous, that is,
‖∇𝑓(𝑥) − ∇𝑓(𝑦)‖2 ≤ 𝐿‖𝑥 − 𝑦‖2 ∀𝑥, 𝑦 ∈ Ω.
An immediate consequence of 𝐿-smoothness is that the function admits a quadratic upper bound.
This property will come in extremely handy in the analysis.
Theorem 2.1 (Quadratic upper bound). Let 𝑓 : Ω → ℝ be 𝐿-smooth on a convex domain Ω.
Then, we can upper bound the function 𝑓 as
𝑓(𝑦) ≤ 𝑓(𝑥) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩ + 𝐿
2 ‖𝑦 − 𝑥‖2
2 ∀𝑥, 𝑦 ∈ Ω. (2)
Proof . The idea is simple: we express the growth 𝑓(𝑦) − 𝑓(𝑥) as the integral of the gradient on
the line connecting 𝑥 to 𝑦, and we then use the Lipschitzness bound on the growth of the gradient
∇𝑓 to bound the growth:
𝑓(𝑦) − 𝑓(𝑥) = ∫
1
0
⟨∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)), 𝑦 − 𝑥⟩ d𝑡
= (∫
1
0
⟨∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − ∇𝑓(𝑥), 𝑦 − 𝑥⟩ d𝑡) + ⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
≤ (∫
1
0
‖∇𝑓(𝑥 + 𝑡 ⋅ (𝑦 − 𝑥)) − ∇𝑓(𝑥)‖2 ⋅ ‖𝑦 − 𝑥‖2 d𝑡) + ⟨∇𝑓 (𝑥), 𝑦 − 𝑥⟩
≤ (∫
1
0
𝑡𝐿‖𝑦 − 𝑥‖2
2 d𝑡) + ⟨∇𝑓 (𝑥), 𝑦 − 𝑥⟩
= 𝐿
2 ‖𝑦 − 𝑥‖2
2 + ⟨∇𝑓 (𝑥), 𝑦 − 𝑥⟩.
Rearranging, we obtain the statement. □
We also mention the following characterization.
Theorem 2.2. For twice differentiable functions 𝑓 : Ω → ℝ defined on an open set Ω ⊆ ℝ𝑛, an
equivalent condition of 𝐿-smoothness is
|𝑣⊤∇2𝑓 (𝑥)𝑣| ≤ 𝐿 ∀𝑥 ∈ Ω, 𝑣 ∈ ℝ𝑛 : ‖𝑣‖2 = 1.
2.2 Convergence in gradient norm: The gradient descent lemma
We can use the quadratic upper bound (Theorem 2.1) to quantify the improvement of one step of
the gradient descent algorithm. This bound on the improvement is often called the gradient descent
lemma.
Theorem 2.3 (Gradient descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth. Then, for any 0 < 𝜂 ≤
1
𝐿 , each step of gradient descent (1) guarantees
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥𝑡) −
𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2.
Proof . We start by writing (2) for the choice 𝑥 = 𝑥𝑡, 𝑦 = 𝑥𝑡+1:
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥𝑡) + ⟨∇𝑓 (𝑥𝑡), 𝑥𝑡+1 − 𝑥𝑡⟩ +
𝐿
2 ‖𝑥𝑡+1 − 𝑥𝑡‖2
2.
Plugging in the gradient descent step 𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡), we therefore obtain
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥𝑡) − 𝜂‖∇𝑓 (𝑥𝑡)‖2
2 +
𝐿
2 𝜂2‖∇𝑓 (𝑥𝑡)‖2
2
= 𝑓(𝑥𝑡) − 𝜂(1 −
𝐿
2 𝜂)‖∇𝑓(𝑥𝑡)‖2
2.
If 𝜂 ≤ 1
𝐿 , the parenthesis is 1 − 𝐿
2 𝜂 ≥ 1
2 , and the result follows. □
The previous result shows that for 𝐿-smooth functions, there exists a good choice of learning rate
(namely, 𝜂 = 1
𝐿 ) such that each step of gradient descent guarantees to improve the function value if
the current point does not have a zero gradient.
Assuming that 𝑓 is lower bounded, the gradient descent lemma also implies that the gradients of
the points produced by gradient descent must eventually become small. Indeed, imagine running the
algorithm for 𝑇 iterations. Then, by repeatedly using the gradient descent lemma, we would have
𝑓(𝑥𝑇 ) ≤ 𝑓(𝑥0) − 𝜂
2 ∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2.
Assume now that 𝑓 is lower bounded by 𝑓⋆ ≔ inf 𝑓. Since 𝑓(𝑥𝑇 ) ≥ 𝑓⋆, the above implies that
∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2 ≤
2
𝜂 (𝑓(𝑥0) − 𝑓⋆).
So, at least one among the {‖∇𝑓(𝑥𝑡)‖2
2}
𝑇 −1
𝑡=0 must be upper bounded by 2
𝜂𝑇 (𝑓(𝑥0) − 𝑓⋆). More for-
mally, we have the following.
Theorem 2.4. Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth, and let 0 < 𝜂 ≤ 1
𝐿 . Then, by running gradient de-
scent for 𝑇 iterations, at least one of the points {𝑥𝑡} encountered must satisfy
‖∇𝑓(𝑥𝑡)‖2 ≤ √2 ⋅ 𝑓(𝑥0) − 𝑓⋆
𝜂𝑇 .
We remark that the above result does not require convexity of 𝑓.
2.3 Convergence in convex function value: The Euclidean mirror descent lemma
The result in Theorem 2.4 only guarantees the decrease of the objective’s gradient norm, not of the
function value. However, while a small gradient indicates a condition of almost local optimality, it
does not imply that the function value is small. To see this, consider the function 𝑓(𝑥) = 𝜀 log(1 +
𝑒𝑥). The gradient satisfies ∇𝑓(𝑥) ≤ 𝜀 for all 𝑥 ∈ ℝ, and so 𝑥 might be arbitrarily far from optimal
while the gradient is small.
■ Step I: The Euclidean mirror descent lemma. To give guarantees in function value, we will now
further assume that 𝑓 is convex, and we will leverage what we will call the Euclidean mirror descent
lemma. In a future lecture, we will see a more general version of this lemma, which we will call
mirror descent lemma without further qualifications.
Theorem 2.5 (Euclidean mirror descent lemma). Let 𝑓 : ℝ𝑛 → ℝ be a convex and differentiable
function. Then, for any choice of stepsize 𝜂, any two consecutive points (𝑥𝑡, 𝑥𝑡+1) produced by
the gradient descent algorithm (1) satisfy
𝑓(𝑥𝑡) ≤ 𝑓 (𝑦) +
1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2) ∀𝑦 ∈ ℝ𝑛
Proof . The proof rests on the following critical observation, often called the three-point equality,
which can be checked by expanding the squared norms: for any 𝑦 ∈ ℝ𝑛,
⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩ = −
1
2 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2).
Since 𝑓 is convex by hypothesis, it satisfies the linear lower bound we saw in Lecture 3. In par-
ticular, we can lower bound the value of 𝑓(𝑦), for any 𝑦 ∈ ℝ𝑛, with the linearization at the point
𝑥𝑡, obtaining
𝑓(𝑦) ≥ 𝑓(𝑥𝑡) + ⟨∇𝑓 (𝑥𝑡), 𝑦 − 𝑥𝑡⟩.
Plugging in the relationship (1), that is, 𝑥𝑡+1 = 𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡), we then obtain
𝑓(𝑦) ≥ 𝑓(𝑥𝑡) +
1
𝜂 ⟨𝑥𝑡 − 𝑥𝑡+1, 𝑦 − 𝑥𝑡⟩.
Substituting the critical observation above and rearranging, we obtain
𝑓(𝑥𝑡) ≤ 𝑓 (𝑦) +
1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2) ∀𝑦 ∈ ℝ𝑛,
which is the statement. □
■ Step II: Telescoping. We can use the Euclidean mirror descent lemma above to get a rate of
decrease in terms of the objective function value 𝑓(𝑥𝑡).
Two steps are necessary: first, since 𝑥𝑡+1 − 𝑥𝑡 = −𝜂∇𝑓(𝑥𝑡), by definition of the gradient descent al-
gorithm (1), we can replace the term ‖𝑥𝑡+1 − 𝑥𝑡‖2
2 in the statement of Theorem 2.5 with 𝜂2‖∇𝑓(𝑥𝑡)‖2
2,
obtaining
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2) +
𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2 ∀𝑦 ∈ ℝ𝑛.
Assuming 𝑓 is 𝐿-smooth and 𝜂 ≤ 1
𝐿 , we can then use the gradient descent lemma (Theorem 2.3)
to bound
𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝑓 (𝑥𝑡) − 𝑓(𝑥𝑡+1),
obtaining the following corollary.
Corollary 2.1. Let 𝑓 : ℝ𝑛 → ℝ be convex and 𝐿-smooth, and 0 < 𝜂 ≤ 1
𝐿 . Then, any two consec-
utive points (𝑥𝑡, 𝑥𝑡+1) produced by gradient descent satisfy
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑦) +
1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2) ∀𝑦 ∈ ℝ𝑛.
In particular, if 𝑓 has a minimizer 𝑥⋆, the above corollary implies that, whenever 𝜂 ≤ 1
𝐿 ,
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥⋆) + 1
2𝜂 (‖𝑥⋆ − 𝑥𝑡‖2
2 − ‖𝑥⋆ − 𝑥𝑡+1‖2
2).
Summing over 𝑡 = 0, ..., 𝑇 − 1, and noticing that the right-hand side is telescopic, we have
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡+1) ≤ 𝑇 𝑓 (𝑥⋆) + 1
2𝜂 ‖𝑥⋆ − 𝑥0‖2
2.
Since 𝑓(𝑥𝑡) is nonincreasing in 𝑡 by the gradient descent lemma, then ∑𝑇 −1
𝑡=0 𝑓(𝑥𝑡+1) ≥ 𝑇 𝑓(𝑥𝑇 ), and
so we can write
𝑇 𝑓(𝑥𝑇 ) ≤ 𝑇 𝑓(𝑥⋆) + 1
2𝜂 ‖𝑥⋆ − 𝑥0‖2
2 ⟹ 𝑓(𝑥𝑇 ) ≤ 𝑓(𝑥⋆) + 1
2𝑇 𝜂 ‖𝑥⋆ − 𝑥0‖2
2.
So, we have proved the following.
Theorem 2.6. Let 𝑓 : ℝ𝑛 → ℝ be convex and 𝐿-smooth with minimizer 𝑥⋆ ∈ ℝ𝑛, and 0 < 𝜂 ≤
1
𝐿 . The 𝑡-th point 𝑥𝑡 ∈ ℝ𝑛 produced by gradient descent satisfies
𝑓(𝑥𝑡) − 𝑓 (𝑥⋆) ≤
‖𝑥⋆ − 𝑥0‖2
2
2𝑡𝜂 ∀𝑡 = 1, 2, 3, ...
2.4 Convergence in iterates
One might wonder whether Theorem 2.6 also implies that the distance between 𝑥𝑡 and 𝑥⋆ shrinks
at the same rate, say ‖𝑥𝑡 − 𝑥⋆‖ ≤ 𝑂(1
𝑡 ). Unfortunately, such a result is false without a stronger
assumption than 𝐿-smoothness. Indeed, as shown by Nesterov, Y. [Nes18] (Theorem 2.1.7 in his
book), the convergence to the optimal point might be arbitrarily slow.
3 Faster convergence under the Polyak-Łojasiewicz condition
We now show that under stronger assumptions on the function 𝑓, gradient descent can be shown to
converge to the optimal point exponentially fast, and in iterates. This contrasts with what we saw
in the previous section, where poly(1/𝜀) iterates were needed to reach an approximation guarantee
of 𝜀.
3.1 The PŁ condition
While most course material considers strong convexity (with which you played in the first problem
set), we instead focus on a much more general condition called the Polyak-Łojasiewicz (PŁ) condi-
tion, introduced independently by Polyak [Pol63] and Łojasiewicz [Łoj63] in 1963.
Fundamentally, the PŁ condition establishes a lower bound on the norm of the function, which
increases as the point becomes more and more suboptimal.
Definition 3.1 (PŁ condition). A lower-bounded function 𝑓 : ℝ𝑛 → ℝ satisfies the Polyak-Ło-
jasiewicz (PŁ) condition if there exists 𝜇 > 0 such that
1
2 ‖∇𝑓(𝑥)‖2 ≥ 𝜇(𝑓 (𝑥) − 𝑓⋆) ∀𝑥 ∈ ℝ𝑛,
where 𝑓⋆ ≔ inf 𝑓.
The benefits of this approach compared to the standard approach of considering strongly convex
functions are many:
• strong convexity implies the PŁ condition but not vice versa.
• the proof of the exponential rate is simpler.
• the PŁ condition does not imply convexity (unlike strong convexity).
In particular, the following condition is sufficient (but not necessary) for the PŁ condition:
Theorem 3.1. If 𝑓 : ℝ𝑛 → ℝ is twice differentiable and ∇2𝑓(𝑥) ≽ 𝜇𝐼 for all 𝑥 ∈ ℝ𝑛 (that is, 𝑓 is
𝜇-strongly convex on ℝ𝑛), then 𝑓 satisfies the PŁ condition with PŁ constant 𝜇.
[▷ You should try to prove this!]
3.2 Gradient descent’s convergence rate with the PŁ condition
Let 𝑓 : ℝ𝑛 → ℝ be an 𝐿-smooth lower-bounded function that satisfies the PŁ condition for some PŁ
constant 𝜇 > 0. Using the gradient descent lemma (Theorem 2.3) and the PŁ condition together,
for 𝜂 = 1
𝐿 we have
𝑓(𝑥𝑡+1) ≤ 𝑓 (𝑥𝑡) − 1
2𝐿 ‖∇𝑓(𝑥𝑡)‖2
2
≤ 𝑓(𝑥𝑡) − 𝜇
𝐿 (𝑓(𝑥𝑡) − 𝑓⋆)
= (1 − 𝜇
𝐿 )𝑓(𝑥𝑡) + 𝜇
𝐿𝑓⋆.
Subtracting 𝑓⋆ on both sides yields
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤ (1 − 𝜇
𝐿)(𝑓(𝑥𝑡) − 𝑓⋆).
Solving the recurrence, we have proved the following.
Theorem 3.2. Let 𝑓 : ℝ𝑛 → ℝ be an 𝐿-smooth function lower-bounded by 𝑓⋆ that satisfies the
PŁ condition for some PŁ constant 𝜇 > 0. Let stepsize 𝜂 = 1
𝐿 . The 𝑡-th point 𝑥𝑡 ∈ ℝ𝑛 produced
by gradient descent satisfies
𝑓(𝑥𝑡) − 𝑓⋆ ≤ (1 −
𝜇
𝐿)
𝑡
(𝑓(𝑥0) − 𝑓⋆).
3.3 Convergence in iterates
We now establish that a function that satisfies the PŁ condition must attain a minimum and that
the iterates produced by gradient descent must converge at a rate exponential in 𝑡.
We show this from first principles, following Polyak, B. T. [Pol63]’s original proof.
Theorem 3.3. Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth, lower bounded by 𝑓⋆, and satisfy the PŁ condition
with PŁ constant 𝜇 > 0. Then, the sequence {𝑥𝑡} of iterates produced by gradient descent in-
stantiated with 0 < 𝜂 ≤ 1
𝐿 converges to some point 𝑥∗ ∈ ℝ𝑛, with
‖𝑥𝑡 − 𝑥⋆‖2
2 ≤ 8𝜂𝐿2
𝜇2 (1 − 𝜇
𝐿)
𝑡−1
(𝑓(𝑥0) − 𝑓⋆).
Hence, by continuity and using Theorem 3.2, lim𝑡→∞ 𝑓 (𝑥𝑡) = 𝑓(𝑥⋆) = 𝑓⋆.
Proof . At all times 𝑡, we have
‖𝑥𝑡+1 − 𝑥𝑡‖2
2 = 𝜂2‖∇𝑓 (𝑥𝑡)‖2
2
≤ 2𝜂(𝑓(𝑥𝑡) − 𝑓 (𝑥𝑡+1)) (gradient descent lemma)
≤ 2𝜂(𝑓(𝑥𝑡) − 𝑓⋆)
≤ 2𝜂(1 − 𝜇
𝐿)
𝑡
(𝑓(𝑥0) − 𝑓⋆) (from Theorem 3.2).
To simplify notation, let 𝛼2 ≔ (1 − 𝜇
𝐿 ). Clearly, 0 ≤ 𝛼 < 1. To show convergence of the sequence
{𝑥𝑡}, we will show that it is a Cauchy sequence and invoke the completeness of ℝ𝑛. Specifically,
for any 𝑚 < 𝑛,
‖𝑥𝑚 − 𝑥𝑛‖2 ≤ ∑
𝑛−1
𝑡=𝑚
‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅ ∑
𝑛−1
𝑡=𝑚
𝛼𝑡
≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅
𝛼𝑚
1 − 𝛼. (closed form for a geometric series of ratio < 1)
This shows that ‖𝑥𝑚 − 𝑥𝑚‖2 → 0 as 𝑚, 𝑛 → ∞; hence, {𝑥𝑡} is a Cauchy sequence (by definition).
By completeness of ℝ𝑛, it follows that 𝑥𝑡 converges to some 𝑥⋆. Letting 𝑚 = 𝑡 and 𝑛 → ∞ above,
‖𝑥𝑡 − 𝑥⋆‖2 ≤ √2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅
𝛼𝑡
1 − 𝛼.
Squaring, we obtain
‖𝑥𝑡 − 𝑥⋆‖2
2 ≤ 2𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅
(𝛼2)𝑡
(1 − 𝛼)2
≤ 8𝜂(𝑓(𝑥0) − 𝑓⋆) ⋅
(𝛼2)𝑡
𝛼2(1 − 𝛼2)2 . (since 1
(1 − 𝛼)2 ≤ 4
𝛼2(1 − 𝛼2)2 )
Plugging in the definition of 𝛼2 = (1 − 𝜇
𝐿 ), we finally obtain
‖𝑥𝑡 − 𝑥⋆‖2
2 = 8𝜂𝐿2
𝜇2 (1 − 𝜇
𝐿)
𝑡−1
(𝑓(𝑥0) − 𝑓⋆),
which is the statement. □
Further readings
The book by Nesterov, Y. [Nes18] is an authoritative reference for a comprehensive treatment of
optimization algorithms for convex and nonconvex functions.
For properties of functions that satisfy the PŁ condition, the paper by Karimi, H., Nutini, J., &
Schmidt, M. [KNS16] is an approachable source.
[Nes18] Y. Nesterov, Lectures on Convex Optimization. Springer International Publishing, 2018.
[Online]. Available: https://link.springer.com/book/10.1007/978-3-319-91578-4
[Pol63] B. T. Polyak, “Gradient methods for the minimisation of functionals,” Ussr Computa-
tional Mathematics and Mathematical Physics, vol. 3, no. 4, pp. 864–878, Dec. 1963, doi:
10.1016/0041-5553(63)90382-3.
[Łoj63] S. Łojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les
équations aux dérivées partielles, vol. 117, no. 87–89, p. 2–3, 1963.
[KNS16] H. Karimi, J. Nutini, and M. Schmidt, “Linear Convergence of Gradient and Prox-
imal-Gradient Methods Under the Polyak-Łojasiewicz Condition,” in Machine Learn-
ing and Knowledge Discovery in Databases, Springer, Sep. 2016, pp. 795–811. doi:
10.1007/978-3-319-46128-1_50.

Content

Typo or question?

Metadata