Gabriele Farina - Acceleration and momentum

MIT 6.7220/15.084 — Nonlinear Optimization Tue, Mar 7th 2024
Lecture 8
Acceleration and momentum
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
In a groundbreaking paper in 1983, Nesterov, Y. [Nes83] showed that a simple variant of gradient
descent—called accelerated gradient descent and applicable to any 𝐿-smooth convex function—pro-
duces iterates with optimality gap 𝑓(𝑥𝑡) − 𝑓⋆ of order 1/𝑡2, as opposed to the 1/𝑡 rate seen in the
previous lecture. The intuition behind accelerated gradient descent is notoriously hard to grasp. The
original proof, rife with algebraic manipulations, is notoriously elusive and has led several authors
to investigate what principles make acceleration possible at a deep level, hoping to generalize the
fundamental principles beyond just gradient descent. These efforts include at least the following
directions.
• Some authors have explained accelerated gradient descent as a reflection of a discretization of
specific ordinary differential equations. This includes the works [SBC16], [KBB15], [WWJ16].
• The work [BLS15] proposed a simple geometric explanation for the possibility of acceleration
based on certain properties of balls in Euclidean space.
• The works [WA18] and [CST21] proposed an interpretation leveraging in light of certain ad-
vancements in the theory of online optimization algorithm.
• The work [AS22] analyzed acceleration as an approximation of another—much more well-un-
derstood—method called the “proximal point method”.
Finally, Allen-Zhu, Z., & Orecchia, L. [AO17] proposed a different framing, whereby accelerated
gradient descent is simply an interpolation (“linear coupling”) between two fundamental descent
modes: gradient descent—fast for large gradients—and mirror descent—fast for small gradients. This
is the point of view we will adopt today.
1 A tale of two descent modes: a second look at the descent lemmas
Consider a convex and 𝐿-smooth function 𝑓 : ℝ𝑛 → ℝ, and let 𝑥⋆ ∈ ℝ𝑛 be a minimizer of 𝑓. Consider
the iterates 𝑥𝑡 produced by gradient descent run with step size 𝜂 > 0. The gradient descent lemma
and the Euclidean mirror descent lemma we saw in Lecture 7 provide two conceptually different
mechanisms for measuring progress at each gradient descent step.
• The gradient descent lemma asserts that the progress made in the function value in two con-
secutive iterates 𝑥𝑡 and 𝑥𝑡+1 is at least as big as the norm of the gradient of 𝑓 at 𝑥𝑡: provided
𝜂 ≤ 1
𝐿 , then
(𝑓(𝑥𝑡+1) − 𝑓⋆) ≤ (𝑓(𝑥𝑡) − 𝑓⋆) − 𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2.
• The Euclidean mirror descent lemma asserts that
‖𝑥𝑡+1 − 𝑥⋆‖2
2 ≤ ‖𝑥𝑡 − 𝑥⋆‖2
2 + 𝜂2‖∇𝑓(𝑥𝑡)‖2
2 − 2𝜂 ⋅ (𝑓(𝑥𝑡) − 𝑓⋆). (1)
Hence, the Euclidean mirror descent lemma establishes that the distance from the optimal
solution decreases fast when the optimality gap 𝑓(𝑥𝑡) − 𝑓⋆ is large and the gradient norm
‖∇𝑓(𝑥𝑡)‖2 is small.
Remark 1.1. When the stepsize 𝜂 is chosen so that 0 < 𝜂 ≤ 1
𝐿 , we can apply the gradient
descent lemma once in (1) and find that
‖𝑥𝑡+1 − 𝑥⋆‖2
2 ≤ ‖𝑥𝑡 − 𝑥⋆‖2
2 − 2𝜂 ⋅ (𝑓 (𝑥𝑡+1) − 𝑓⋆),
which implies a monotonic decrease in the Euclidean distance to optimality.
The two lemmas focus on two different performance metrics. The gradient descent lemma focuses
on progress in the function value. The Euclidean mirror descent lemma focuses on progress on the
Euclidean distance to optimality.
1.1 A thought experiments
We will consider a thought experiment to build intuition behind the construction of accelerated
gradient descent. Imagine running gradient descent on an 𝐿-smooth function using stepsize 𝜂. We
now consider two extreme cases.
■ Large gradients. As a first case, we consider the case of “large” gradient norms.
Theorem 1.1. Suppose that all the gradients of the points produced by gradient descent satisfy
‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾 at all 𝑡 = 0, 1, ...
for some constant 𝛾 > 0. In this case, using the stepsize 𝜂 = 1
𝐿 , after 𝑇half ≔ 𝐿
𝛾 (𝑓(𝑥0) − 𝑓⋆) iter-
ations the optimality gap will halve, that is,
𝑓(𝑥𝑇half ) − 𝑓⋆ ≤
𝑓(𝑥0) − 𝑓⋆
2 .
Proof . The gradient descent lemma implies that
𝑓(𝑥𝑡+1) − 𝑓 (𝑥𝑡) ≤ −
1
2𝐿‖∇𝑓(𝑥𝑡)‖2
2 ≤ −
𝛾
2𝐿 .
So, after 𝑇half iterations the function value is
𝑓(𝑥𝑇half ) − 𝑓 (𝑥0) ≤ −
𝛾
2𝐿 𝑇half = −
1
2(𝑓(𝑥0) − 𝑓⋆).
Rearranging yields the statement. □
■ Small gradients. On the other extreme, we consider the case in which all the gradients encountered
are “small”, in the sense that their square norm is below some threshold value 𝛾.
Theorem 1.2. Consider the case where
‖∇𝑓(𝑥𝑡)‖2
2 < 𝛾 at all 𝑡 = 0, 1, ...
for some constant 𝛾 > 0. In this case, using the stepsize 𝜂 ≔ (𝑓(𝑥0) − 𝑓⋆)/(2𝛾), after
𝑇half ≔ 4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2
iterations we will find a point 𝑥half ∈ {𝑥0, 𝑥1, ..., 𝑥𝑇half } with optimality gap
𝑓(𝑥half) − 𝑓⋆ ≤
𝑓(𝑥0) − 𝑓⋆
2 .
Proof . In this case, for every 𝜂 > 0, the Euclidean mirror descent lemma guarantees that
1
𝑇 ∑
𝑇
𝑡=1
𝑓(𝑥𝑡) ≤ 𝑓⋆ +
1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 +
𝜂
2𝑇 ∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2
< 𝑓⋆ +
1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 +
𝜂𝛾
2 .
Rearranging terms, we find
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) <
1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 +
𝜂𝛾
2
Setting 𝜂 = 𝑓(𝑥0)−𝑓⋆
2𝛾 and plugging the value of 𝑇half given in the statement, we have
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) <
𝛾
𝑇 (𝑓(𝑥0) − 𝑓⋆)
‖𝑥⋆ − 𝑥0‖2
2 +
𝑓(𝑥0) − 𝑓⋆
4
= 𝑓(𝑥0) − 𝑓⋆
2 .
Finally, recognizing that the minimum is upper bounded by the average, we conclude that
min
𝑇
𝑡=1 𝑓(𝑥𝑡) − 𝑓⋆ <
𝑓(𝑥0) − 𝑓⋆
2 ,
which implies the statement. □
■ Balancing the two cases. In summary, the two cases above reveal that, assuming the stepsize is
chosen well:
• when ‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾 at all times 𝑡, we can halve the optimality gap within
𝐿
𝛾 (𝑓(𝑥0) − 𝑓⋆)
iterations; and
• when ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝛾 at all times 𝑡, we can halve the optimality gap within
4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2
iterations.
In the first case, the number of required iterations decreases as 𝛾 increases, while in the second
case, it increases. The value of 𝛾 that minimizes the maximum halving time across the two cases is
therefore attained when
𝐿
𝛾 (𝑓(𝑥0) − 𝑓⋆) = 4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2 ⟹ 𝛾 =
√𝐿(𝑓(𝑥0) − 𝑓⋆)3/2
2‖𝑥0 − 𝑥⋆‖ .
For such a value of the threshold 𝛾, both cases require at most
𝑇half ≔ 2‖𝑥0 − 𝑥⋆‖2
√𝐿
√𝑓(𝑥0) − 𝑓⋆
iterations to halve the optimality gap.
This is in contrast with running gradient descent with the optimal stepsize 𝜂 = 1
𝐿 , which instead
requires a number of iterations at most (Theorem 2.6 of Lecture 7¹)
¹Theorem 2.6 of Lecture 7 asserts that the 𝑡-th iterate produced by the gradient descent algorithm run with stepsize
𝜂 = 1
𝐿 satisfies 𝑓(𝑥𝑡) − 𝑓⋆ ≤ 𝐿 ‖𝑥0−𝑥⋆‖2
2
2𝑡 . So, after 𝑡 = 𝐿‖𝑥0−𝑥⋆‖2
2
𝑓(𝑥0)−𝑓⋆
iterations, we have 𝑓(𝑥𝑡) − 𝑓⋆ ≤ 𝑓(𝑥0)−𝑓⋆
2 .
𝐿‖𝑥0 − 𝑥⋆‖2
2
𝑓(𝑥0) − 𝑓⋆
,
a quadratic slowdown compared to 𝑇half.
2 Allen-Zhu and Orecchia’s linear coupling
Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth and convex, with minimum value 𝑓⋆ attained in (at least one) point
𝑥⋆ ∈ ℝ𝑛. The discussion above reveals that there is hope to construct an accelerated gradient descent
method. However, the approach is not formal since it might be the case that neither ‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾
at all times, nor ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝛾 at all times.
To fix the construction, Allen-Zhu and Orecchia propose an algorithm that performs both
• a step with learning rate 1/𝐿, corresponding to the “large gradient” case above (“short step”);
and
• a step using the larger stepsize corresponding to the “small gradients” case above (“long step”).
The “final” iterate produced by the algorithm is computed as a linear interpolation between these
two steps.
In formulas, the algorithm keeps track of three sequences {𝑥𝑡}, {𝑦𝑡}, {𝑧𝑡}. The sequence 𝑥𝑡 corre-
sponds to the “final” iterate, while the sequences 𝑦𝑡 and 𝑧𝑡 correspond to the short and long steps,
respectively. At the beginning,
𝑥0 = 𝑦0 = 𝑧0.
Then, at each iteration 𝑡, we let
𝑥𝑡+1 ≔ (1 − 𝜏 )𝑦𝑡 + 𝜏 𝑧𝑡 (interpolation with coupling rate 𝜏 )
𝑦𝑡+1 ≔ 𝑥𝑡+1 −
1
𝐿 ∇𝑓(𝑥𝑡+1) (“short” gradient step)
𝑧𝑡+1 ≔ 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1) (“long” gradient step, with stepsize 𝛼)
Remark 2.1. The quantity 𝑧𝑡 = 𝑧0 − 𝛼 ∑𝑡
𝑠=1 ∇𝑓 (𝑥𝑠) keeps track of the sum of past gradients,
which is then combined into the definition of 𝑥𝑡+1. This term is often called momentum.
Intuitively, when the optimization algorithm is unstable, and the gradients go back and forth,
the momentum is small, so the learning rate can be decreased to stabilize the algorithm. On the
other hand, when the algorithm is making steady progress in the same direction, the momentum
term increases the learning rate to accelerate convergence.
The analysis of the convergence rate of this interpolated variant of gradient descent follows the same
conceptual steps we saw last time: first, we will establish an interpolated version of the gradient
descent lemma, and then we will establish an interpolated version of the Euclidean mirror descent
lemma.
2.1 The coupled gradient descent lemma
Since 𝑦𝑡+1 is obtained from 𝑥𝑡+1 by taking a step in the direction −∇𝑓(𝑥𝑡+1) using the theoretically-
optimal stepsize of 1/𝐿, the proof of the gradient descent lemma seen in Lecture 7 applies verbatim,
yielding that all times 𝑡,
𝑓(𝑦𝑡+1) ≤ 𝑓 (𝑥𝑡+1) −
1
2𝐿‖∇𝑓(𝑥𝑡+1)‖2
2.
2.2 The coupled Euclidean mirror descent lemma
The derivation of an interpolated version of the Euclidean mirror descent lemma is significantly more
laborious but only involves elementary techniques. In particular, we have the following.
Theorem 2.1. At all times 𝑡,
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤
1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) +
1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓 (𝑥𝑡+1))
Proof (Optional) . Like in Lecture 7, we start from the three-point equality to write
⟨𝑧𝑡 − 𝑧𝑡+1, 𝑧𝑡 − 𝑥⋆⟩ =
1
2 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2).
Using the fact that 𝑧𝑡+1 = 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1), we can therefore write
𝛼⟨∇𝑓(𝑥𝑡+1), 𝑧𝑡 − 𝑥⋆⟩ =
1
2 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2).
We now use the definition of linear coupling
𝑥𝑡+1 = (1 − 𝜏 )𝑦𝑡 + 𝜏 𝑧𝑡 ⟹ (𝑥𝑡+1 − 𝑥⋆) = (𝑧𝑡 − 𝑥⋆) +
1 − 𝜏
𝜏 (𝑦𝑡 − 𝑥𝑡+1)
to write
⟨∇𝑓(𝑥𝑡+1), 𝑥𝑡+1 − 𝑥⋆⟩
= ⟨∇𝑓(𝑥𝑡+1), 𝑧𝑡 − 𝑥⋆⟩ +
1 − 𝜏
𝜏 ⟨∇𝑓(𝑥𝑡+1), 𝑦𝑡 − 𝑥𝑡+1⟩
= 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) +
1 − 𝜏
𝜏 ⟨∇𝑓(𝑥𝑡+1), 𝑦𝑡 − 𝑥𝑡+1⟩
≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) +
1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓 (𝑥𝑡+1)). (Convexity)
Since by convexity
𝑓⋆ ≥ 𝑓(𝑥𝑡+1) + ⟨∇𝑓 (𝑥𝑡+1), 𝑥⋆ − 𝑥𝑡+1⟩ ⟹ ⟨∇𝑓(𝑥𝑡+1), 𝑥𝑡+1 − 𝑥⋆⟩ ≥ 𝑓 (𝑥𝑡+1) − 𝑓⋆,
we obtain the statement. □
Just like in the proof in Lecture 7, we can use the fact that 𝑧𝑡+1 = 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1) in the coupled
Euclidean mirror descent lemma above to find that
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) +
𝛼
2 ‖∇𝑓(𝑥𝑡+1)‖2
2 +
1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1))
≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) + 𝛼𝐿(𝑓 (𝑥𝑡+1) − 𝑓 (𝑦𝑡+1)) +
1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1)),
where we used the coupled gradient descent lemma in the second inequality. Now, if 𝜏 is chosen
such that
1 − 𝜏
𝜏 = 𝛼𝐿, that is, 𝜏 = 1
1 + 𝛼𝐿 ,
the previous inequality simplifies into the following result.
Theorem 2.2 (Coupling, Lemma 3.2 in [AO17]). Pick 𝜏 such that 1−𝜏
𝜏 = 𝛼𝐿. Then, at all times 𝑡,
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤
1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) + 𝛼𝐿(𝑓 (𝑦𝑡) − 𝑓 (𝑦𝑡+1)).
2.3 Putting the pieces together
We are now ready to perform the telescoping step that we saw in Lecture 7 using the new coupled
variant of the Euclidean mirror descent lemma given in Theorem 2.2. Specifically, we can prove the
following.
Theorem 2.3. Let 𝛼 and 𝜏 be defined so that
𝛼 ≔ ‖𝑥⋆ − 𝑧0‖2
√2(𝑓(𝑥0) − 𝑓⋆)
, 𝜏 ≔ 1
1 + 𝛼𝐿.
Then, Allen-Zhu and Orecchia’s accelerated gradient descent finds at least one iterate 𝑥𝑡 such
that
𝑓(𝑥𝑡) − 𝑓⋆ ≤
1
2 (𝑓(𝑥0) − 𝑓⋆)
within
𝑇half ≔ 2‖𝑥⋆ − 𝑥0‖2
√2𝐿
√𝑓(𝑥0) − 𝑓⋆
iterations.
Proof . Averaging the inequality in Theorem 2.2 over 𝑡 = 0, 1, ..., 𝑇 − 1, we obtain
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) ≤
1
2𝑇 𝛼 (‖𝑥⋆ − 𝑧0‖2
2 − ‖𝑥⋆ − 𝑧𝑇 ‖2
2) +
𝛼𝐿
𝑇 (𝑓(𝑦0) − 𝑓 (𝑦𝑇 ))
≤ 1
2𝑇 𝛼 ‖𝑥⋆ − 𝑧0‖2
2 +
𝛼𝐿
𝑇 (𝑓(𝑦0) − 𝑓⋆).
Plugging in the proposed values of 𝛼 and 𝑇half verifies the statement. □
3 Final remarks
The idea of momentum, tracking the sum or average of all past gradients and using it when defining
the next point, is extremely useful in machine learning. It extends well past gradient descent.
While the algorithm above is theoretically safe and interpolates the momentum term, in practice,
people like to rewrite the update step to use momentum directly as
𝑥𝑡+1 ≈ 𝑥𝑡 − 𝜂𝑔𝑡, where 𝑔𝑡 ≔ 𝜇 ∑
𝑡
𝑠=1
(1 − 𝜇)𝑡−𝑠∇𝑓 (𝑥𝑡).
This is essentially what goes on when you call optim.SGD(nesterov=True) from pytorch.
Bibliography
[Nes83] Y. Nesterov, “A method for solving the convex programming problem with convergence
rate 𝑂(1/𝑘2)”, Proceedings of the USSR Academy of Sciences, 1983.
[SBC16] W. Su, S. Boyd, and E. J. Candès, “A Differential Equation for Modeling Nesterov's
Accelerated Gradient Method: Theory and Insights,” Journal of Machine Learning Re-
search, vol. 17, no. 153, pp. 1–43, 2016, [Online]. Available: https://jmlr.org/papers/v
17/15-084.html
[KBB15] W. Krichene, A. Bayen, and P. L. Bartlett, “Accelerated Mirror Descent in Continuous
and Discrete Time,” Advances in Neural Information Processing Systems, vol. 28, 2015.
[WWJ16] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated
methods in optimization,” Proc. Natl. Acad. Sci. U.S.A., vol. 113, no. 47, p. E7351–
E7358, Nov. 2016, doi: 10.1073/pnas.1614734113.
[BLS15] S. Bubeck, Y. T. Lee, and M. Singh, “A geometric alternative to Nesterov's accelerated
gradient descent.”
[WA18] J.-K. Wang and J. D. Abernethy, “Acceleration through Optimistic No-Regret Dynam-
ics,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[CST21] M. B. Cohen, A. Sidford, and K. Tian, “Relative Lipschitzness in Extragradient Methods
and a Direct Recipe for Acceleration,” in 12th Innovations in Theoretical Computer Sci-
ence Conference (ITCS 2021), 2021.
[AS22] K. Ahn and S. Sra, “Understanding Nesterov's Acceleration via Proximal Point Method,”
in Symposium on Simplicity in Algorithms (SOSA), 2022, pp. 117–130.
[AO17] Z. Allen-Zhu and L. Orecchia, “Linear Coupling: An Ultimate Unification of Gradient
and Mirror Descent,” in 8th Innovations in Theoretical Computer Science Conference
(ITCS 2017), 2017.

Content

Typo or question?

Metadata