Gabriele Farina - Acceleration and momentum

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Thu, Apr 3rd 2025
Lecture 13
Acceleration and momentum
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In a groundbreaking paper in 1983, Nesterov, Y. [Nes83] showed that a simple variant
of gradient descent—called accelerated gradient descent and applicable to any 𝐿-smooth
convex function—produces iterates with optimality gap 𝑓(𝑥𝑡) − 𝑓⋆ of order 1/𝑡2, as opposed
to the 1/𝑡 rate seen in the previous lecture. The intuition behind accelerated gradient
descent is notoriously hard to grasp. The original proof, rife with algebraic manipulations,
is notoriously elusive and has led several authors to investigate what principles make
acceleration possible at a deep level, hoping to generalize the fundamental principles beyond
just gradient descent. These efforts include at least the following directions.
• Some authors have explained accelerated gradient descent as a reflection of a dis-
cretization of specific ordinary differential equations. This includes the works [SBC16],
[KBB15], [WWJ16].
• The work [BLS15] proposed a simple geometric explanation for the possibility of
acceleration based on certain properties of balls in Euclidean space.
• The works [WA18] and [CST21] proposed an interpretation leveraging in light of
certain advancements in the theory of online optimization algorithm.
• The work [AS22] analyzed acceleration as an approximation of another—much more
well-understood—method called the “proximal point method”.
Finally, Allen-Zhu, Z., & Orecchia, L. [AO17] proposed a different framing, whereby
accelerated gradient descent is simply an interpolation (“linear coupling”) between two
fundamental descent modes: gradient descent—fast for large gradients—and mirror descent
—fast for small gradients. This is the point of view we will adopt today.
L13.1 A second look at the descent lemmas
Consider a convex and 𝐿-smooth function 𝑓 : ℝ𝑛 → ℝ, and let 𝑥⋆ ∈ ℝ𝑛 be a minimizer of 𝑓.
Consider the iterates 𝑥𝑡 produced by gradient descent run with step size 𝜂 > 0. The gradient
descent lemma and the Euclidean mirror descent lemma we saw in Lecture 12 provide two
conceptually different mechanisms for measuring progress at each gradient descent step.
• The gradient descent lemma asserts that the progress made in the function value in
two consecutive iterates 𝑥𝑡 and 𝑥𝑡+1 is at least as big as the norm of the gradient of
𝑓 at 𝑥𝑡: provided 𝜂 ≤ 1
𝐿 , then
(𝑓(𝑥𝑡+1) − 𝑓⋆) ≤ (𝑓(𝑥𝑡) − 𝑓⋆) − 𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2.
• The Euclidean mirror descent lemma asserts that
‖𝑥𝑡+1 − 𝑥⋆‖2
2 ≤ ‖𝑥𝑡 − 𝑥⋆‖2
2 + 𝜂2‖∇𝑓(𝑥𝑡)‖2
2 − 2𝜂 ⋅ (𝑓(𝑥𝑡) − 𝑓⋆). (1)
Hence, the Euclidean mirror descent lemma establishes that the distance from the
optimal solution decreases fast when the optimality gap 𝑓(𝑥𝑡) − 𝑓⋆ is large and the
gradient norm ‖∇𝑓(𝑥𝑡)‖2 is small.
Remark L13.1. When the stepsize 𝜂 is chosen so that 0 < 𝜂 ≤ 1
𝐿 , we can apply the
gradient descent lemma once in (1) and find that
‖𝑥𝑡+1 − 𝑥⋆‖2
2 ≤ ‖𝑥𝑡 − 𝑥⋆‖2
2 − 2𝜂 ⋅ (𝑓(𝑥𝑡+1) − 𝑓⋆),
which implies a monotonic decrease in the Euclidean distance to optimality.
The two lemmas focus on two different performance metrics. The gradient descent lemma
focuses on progress in the function value. The Euclidean mirror descent lemma focuses on
progress on the Euclidean distance to optimality.
L13.1.1 A thought experiment
We will consider a thought experiment to build intuition behind the construction of accel-
erated gradient descent. Imagine running gradient descent on an 𝐿-smooth function using
stepsize 𝜂. We now consider two extreme cases.
■ Large gradients. As a first case, we consider the case of “large” gradient norms.
Theorem L13.1. Suppose that all the gradients of the points produced by gradient
descent satisfy
‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾 at all 𝑡 = 0, 1, ...
for some constant 𝛾 > 0. In this case, using the stepsize 𝜂 = 1
𝐿 , after 𝑇half ≔ 𝐿
𝛾 (𝑓(𝑥0) −
𝑓⋆) iterations the optimality gap will halve, that is,
𝑓(𝑥𝑇half ) − 𝑓⋆ ≤ 𝑓(𝑥0) − 𝑓⋆
2 .
Proof. The gradient descent lemma implies that
𝑓(𝑥𝑡+1) − 𝑓(𝑥𝑡) ≤ − 1
2𝐿‖∇𝑓(𝑥𝑡)‖2
2 ≤ − 𝛾
2𝐿.
So, after 𝑇half iterations the function value is
𝑓(𝑥𝑇half ) − 𝑓(𝑥0) ≤ − 𝛾
2𝐿 𝑇half = − 1
2(𝑓(𝑥0) − 𝑓⋆).
Rearranging yields the statement. □
■ Small gradients. On the other extreme, we consider the case in which all the gradients
encountered are “small”, in the sense that their square norm is below some threshold 𝛾.
Theorem L13.2. Consider the case where
‖∇𝑓(𝑥𝑡)‖2
2 < 𝛾 at all 𝑡 = 0, 1, ...
for some constant 𝛾 > 0. In this case, using the stepsize 𝜂 ≔ (𝑓(𝑥0) − 𝑓⋆)/(2𝛾), after
𝑇half ≔ 4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2
iterations we will find a point 𝑥half ∈ {𝑥0, 𝑥1, ..., 𝑥𝑇half } with optimality gap
𝑓(𝑥half) − 𝑓⋆ ≤ 𝑓(𝑥0) − 𝑓⋆
2 .
Proof. In this case, for every 𝜂 > 0, the Euclidean mirror descent lemma guarantees that
1
𝑇 ∑
𝑇
𝑡=1
𝑓(𝑥𝑡) ≤ 𝑓⋆ + 1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 + 𝜂
2𝑇 ∑
𝑇 −1
𝑡=0
‖∇𝑓(𝑥𝑡)‖2
2
< 𝑓⋆ + 1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 + 𝜂𝛾
2 .
Rearranging terms, we find
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) < 1
2𝜂𝑇 ‖𝑥⋆ − 𝑥0‖2
2 + 𝜂𝛾
2
Setting 𝜂 ≔ 𝑓(𝑥0)−𝑓⋆
2𝛾 and plugging the value of 𝑇half given in the statement, we have
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) < 𝛾
𝑇 (𝑓(𝑥0) − 𝑓⋆) ‖𝑥⋆ − 𝑥0‖2
2 + 𝑓(𝑥0) − 𝑓⋆
4 = 𝑓(𝑥0) − 𝑓⋆
2 .
Finally, recognizing that the minimum is upper bounded by the average, we conclude that
min
𝑇
𝑡=1 𝑓(𝑥𝑡) − 𝑓⋆ < 𝑓(𝑥0) − 𝑓⋆
2 ,
which implies the statement. □
■ Balancing the two cases. In summary, the two cases above reveal that, assuming the
stepsize is chosen well:
• when ‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾 at all times 𝑡, we can halve the optimality gap within
𝐿
𝛾 (𝑓(𝑥0) − 𝑓⋆)
iterations; and
• when ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝛾 at all times 𝑡, we can halve the optimality gap within
4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2
iterations.
In the first case, the number of required iterations decreases as 𝛾 increases, while in the
second case, it increases. The value of 𝛾 that minimizes the maximum halving time across
the two cases is therefore attained when
𝐿
𝛾 (𝑓(𝑥0) − 𝑓⋆) = 4𝛾 ‖𝑥0 − 𝑥⋆‖2
2
(𝑓(𝑥0) − 𝑓⋆)2 ⟹ 𝛾 =
√𝐿(𝑓(𝑥0) − 𝑓⋆)3/2
2‖𝑥0 − 𝑥⋆‖ .
For such a value of the threshold 𝛾, both cases require at most
𝑇half ≔ 2‖𝑥0 − 𝑥⋆‖2
√𝐿
√𝑓(𝑥0) − 𝑓⋆
iterations to halve the optimality gap.
This is in contrast with running gradient descent with the optimal stepsize 𝜂 = 1
𝐿 , which
instead requires a number of iterations at most (Theorem L12.6 of Lecture 12¹)
𝐿‖𝑥0 − 𝑥⋆‖2
2
𝑓(𝑥0) − 𝑓⋆ ,
a quadratic slowdown compared to 𝑇half.
L13.2 Allen-Zhu and Orecchia’s linear coupling
Let 𝑓 : ℝ𝑛 → ℝ be 𝐿-smooth and convex, with minimum value 𝑓⋆ attained in (at least one)
point 𝑥⋆ ∈ ℝ𝑛. The discussion above reveals that there is hope to construct an accelerated
gradient descent method. However, the approach is not formal since it might be the case
that neither ‖∇𝑓(𝑥𝑡)‖2
2 ≥ 𝛾 at all times, nor ‖∇𝑓(𝑥𝑡)‖2
2 ≤ 𝛾 at all times.
To fix the construction, Allen-Zhu and Orecchia propose an algorithm that performs both
• a step with learning rate 1/𝐿, corresponding to the “large gradient” case above (“short
step”); and
• a step using the larger stepsize corresponding to the “small gradients” case above
(“long step”).
The “final” iterate produced by the algorithm is computed as a linear interpolation between
these two steps.
In formulas, the algorithm keeps track of three sequences {𝑥𝑡}, {𝑦𝑡}, {𝑧𝑡}. The sequence 𝑥𝑡
corresponds to the “final” iterate, while the sequences 𝑦𝑡 and 𝑧𝑡 correspond to the short
and long steps, respectively. At the beginning,
𝑥0 = 𝑦0 = 𝑧0.
Then, at each iteration 𝑡, we let
𝑥𝑡+1 ≔ (1 − 𝜏 )𝑦𝑡 + 𝜏 𝑧𝑡 (interpolation with coupling rate 𝜏 )
𝑦𝑡+1 ≔ 𝑥𝑡+1 − 1
𝐿∇𝑓(𝑥𝑡+1) (“short” gradient step)
𝑧𝑡+1 ≔ 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1) (“long” gradient step, with stepsize 𝛼)
Remark L13.2. The quantity
𝑧𝑡 = 𝑧0 − 𝛼 ∑
𝑡
𝑠=1
∇𝑓(𝑥𝑠)
keeps track of the sum of past gradients, which is then combined into the definition of
𝑥𝑡+1. This term is often called momentum.
Intuitively, when the optimization algorithm is unstable, and the gradients go back and
forth, the momentum is small, so the learning rate can be decreased to stabilize the
algorithm. On the other hand, when the algorithm is making steady progress in the same
direction, the momentum term increases the learning rate to accelerate convergence.
The analysis of the convergence rate of this interpolated variant of gradient descent follows
the same conceptual steps we saw last time: first, we will establish an interpolated version
of the gradient descent lemma, and then we will establish an interpolated version of the
Euclidean mirror descent lemma.
L13.2.1 The coupled gradient descent lemma
Since 𝑦𝑡+1 is obtained from 𝑥𝑡+1 by taking a step in the direction −∇𝑓(𝑥𝑡+1) using the
theoretically-optimal stepsize of 1/𝐿, the proof of the gradient descent lemma seen in
Lecture 12 applies verbatim, yielding that at all times 𝑡,
𝑓(𝑦𝑡+1) ≤ 𝑓(𝑥𝑡+1) − 1
2𝐿‖∇𝑓(𝑥𝑡+1)‖2
2.
L13.2.2 The coupled Euclidean mirror descent lemma
The derivation of an interpolated version of the Euclidean mirror descent lemma is signifi-
cantly more laborious but only involves elementary techniques. In particular, we have the
following.
Theorem L13.3. At all times 𝑡,
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) + 1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1))
Proof (Optional). Like in Lecture 12, we start from the three-point equality to write
⟨𝑧𝑡 − 𝑧𝑡+1, 𝑧𝑡 − 𝑥⋆⟩ = 1
2 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2).
Using the fact that 𝑧𝑡+1 = 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1), we can therefore write
𝛼⟨∇𝑓(𝑥𝑡+1), 𝑧𝑡 − 𝑥⋆⟩ = 1
2 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2).
We now use the definition of linear coupling
𝑥𝑡+1 = (1 − 𝜏 )𝑦𝑡 + 𝜏 𝑧𝑡 ⟹ (𝑥𝑡+1 − 𝑥⋆) = (𝑧𝑡 − 𝑥⋆) + 1 − 𝜏
𝜏 (𝑦𝑡 − 𝑥𝑡+1)
to write
⟨∇𝑓(𝑥𝑡+1), 𝑥𝑡+1 − 𝑥⋆⟩
= ⟨∇𝑓(𝑥𝑡+1), 𝑧𝑡 − 𝑥⋆⟩ + 1 − 𝜏
𝜏 ⟨∇𝑓(𝑥𝑡+1), 𝑦𝑡 − 𝑥𝑡+1⟩
= 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) + 1 − 𝜏
𝜏 ⟨∇𝑓(𝑥𝑡+1), 𝑦𝑡 − 𝑥𝑡+1⟩
≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2 + ‖𝑧𝑡 − 𝑧𝑡+1‖2
2) + 1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1)),
where the inequality follows by convexity. Since by using again convexity
𝑓⋆ ≥ 𝑓(𝑥𝑡+1) + ⟨∇𝑓(𝑥𝑡+1), 𝑥⋆ − 𝑥𝑡+1⟩ ⟹ ⟨∇𝑓(𝑥𝑡+1), 𝑥𝑡+1 − 𝑥⋆⟩ ≥ 𝑓(𝑥𝑡+1) − 𝑓⋆,
we obtain the statement. □
Just like in the proof in Lecture 12, we can use the fact that 𝑧𝑡+1 = 𝑧𝑡 − 𝛼∇𝑓(𝑥𝑡+1) in the
coupled Euclidean mirror descent lemma above to find that
𝑓(𝑥𝑡+1) − 𝑓⋆
≤ 1
2𝛼 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) + 𝛼
2 ‖∇𝑓(𝑥𝑡+1)‖2
2 + 1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1))
≤ 1
2𝛼 (‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) + 𝛼𝐿(𝑓(𝑥𝑡+1) − 𝑓(𝑦𝑡+1)) + 1 − 𝜏
𝜏 (𝑓(𝑦𝑡) − 𝑓(𝑥𝑡+1)),
where we used the coupled gradient descent lemma in the second inequality. Now, if 𝜏 is
chosen such that
1 − 𝜏
𝜏 = 𝛼𝐿, that is, 𝜏 = 1
1 + 𝛼𝐿 ,
the previous inequality simplifies into the following result.
Theorem L13.4 (Coupling, Lemma 3.2 in [AO17]). Pick 𝜏 such that 1−𝜏
𝜏 = 𝛼𝐿. Then,
at all times 𝑡,
𝑓(𝑥𝑡+1) − 𝑓⋆ ≤ 1
2𝛼(‖𝑥⋆ − 𝑧𝑡‖2
2 − ‖𝑥⋆ − 𝑧𝑡+1‖2
2) + 𝛼𝐿(𝑓(𝑦𝑡) − 𝑓(𝑦𝑡+1)).
L13.2.3 Putting the pieces together
We are now ready to perform the telescoping step that we saw in Lecture 12 using the new
coupled variant of the Euclidean mirror descent lemma given in Theorem L13.4. Specifically,
we can prove the following.
Theorem L13.5. Let 𝛼 and 𝜏 be defined so that
𝛼 ≔ ‖𝑥⋆ − 𝑧0‖2
√2(𝑓(𝑥0) − 𝑓⋆) , 𝜏 ≔ 1
1 + 𝛼𝐿.
Then, Allen-Zhu and Orecchia’s accelerated gradient descent finds at least one iterate
𝑥𝑡 such that
𝑓(𝑥𝑡) − 𝑓⋆ ≤ 1
2 (𝑓(𝑥0) − 𝑓⋆) within 𝑇half ≔ 2‖𝑥⋆ − 𝑥0‖2
√2𝐿
√𝑓(𝑥0) − 𝑓⋆ iterations.
Proof. Averaging the inequality in Theorem L13.4 over 𝑡 = 0, 1, ..., 𝑇 − 1, we obtain
1
𝑇 ∑
𝑇
𝑡=1
(𝑓(𝑥𝑡) − 𝑓⋆) ≤ 1
2𝑇 𝛼(‖𝑥⋆ − 𝑧0‖2
2 − ‖𝑥⋆ − 𝑧𝑇 ‖2
2) + 𝛼𝐿
𝑇 (𝑓(𝑦0) − 𝑓(𝑦𝑇 ))
≤ 1
2𝑇 𝛼‖𝑥⋆ − 𝑧0‖2
2 + 𝛼𝐿
𝑇 (𝑓(𝑦0) − 𝑓⋆).
Plugging in the proposed values of 𝛼 and 𝑇half verifies the statement. □
L13.3 Final remarks
The idea of momentum, tracking the sum or average of all past gradients and using it
when defining the next point, is extremely useful in machine learning. It extends well past
gradient descent.
While the algorithm above is theoretically safe and interpolates the momentum term, in
practice, people like to rewrite the update step to use momentum directly as
𝑥𝑡+1 ≈ 𝑥𝑡 − 𝜂𝑔𝑡, where 𝑔𝑡 ≔ 𝜇 ∑
𝑡
𝑠=1
(1 − 𝜇)𝑡−𝑠∇𝑓(𝑥𝑠).
This is essentially what goes on when you call optim.SGD(nesterov=True) from pytorch.
Bibliography for this lecture
[Nes83] Nesterov, Y. (1983). A method for solving the convex programming problem
with convergence rate 𝑂(1/𝑘2). Proceedings of the USSR Academy of Sciences.
[SBC16] Su, W., Boyd, S., & Candès, E. J. (2016). A Differential Equation for Model-
ing Nesterov's Accelerated Gradient Method: Theory and Insights. Journal of
Machine Learning Research, 17(153), 1–43. https://jmlr.org/papers/v17/15-
084.html
[KBB15] Krichene, W., Bayen, A., & Bartlett, P. L. (2015). Accelerated Mirror Descent
in Continuous and Discrete Time. Advances in Neural Information Processing
Systems, 28.
[WWJ16] Wibisono, A., Wilson, A. C., & Jordan, M. I. (2016). A variational perspective
on accelerated methods in optimization. Proc. Natl. Acad. Sci. U.S.A., 113(47),
E7351–E7358. https://doi.org/10.1073/pnas.1614734113
[BLS15] Bubeck, S., Lee, Y. T., & Singh, M. (2015, June). A geometric alternative to
Nesterov's accelerated gradient descent. https://doi.org/10.48550/arXiv.1506.
08187
[WA18] Wang, J.-K., & Abernethy, J. D. (2018). Acceleration through Optimistic No-
Regret Dynamics. Advances in Neural Information Processing Systems, 31.
[CST21] Cohen, M. B., Sidford, A., & Tian, K. (2021). Relative Lipschitzness in Extra-
gradient Methods and a Direct Recipe for Acceleration. 12th Innovations in
Theoretical Computer Science Conference (ITCS 2021).
[AS22] Ahn, K., & Sra, S. (2022). Understanding Nesterov's Acceleration via Proximal
Point Method. Symposium on Simplicity in Algorithms (SOSA), 117–130.
[AO17] Allen-Zhu, Z., & Orecchia, L. (2017). Linear Coupling: An Ultimate Unification
of Gradient and Mirror Descent. 8th Innovations in Theoretical Computer
Science Conference (ITCS 2017).
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹Theorem L12.6 of Lecture 12 asserts that the 𝑡-th iterate produced by the gradient descent algorithm
run with stepsize 𝜂 = 1
𝐿 satisfies 𝑓(𝑥𝑡) − 𝑓⋆ ≤ 𝐿 ‖𝑥0−𝑥⋆‖2
2
2𝑡 . So, after 𝑡 = 𝐿 ‖𝑥0−𝑥⋆‖2
2
𝑓(𝑥0)−𝑓⋆ iterations, we have
𝑓(𝑥𝑡) − 𝑓⋆ ≤ 𝑓(𝑥0) − 𝑓⋆
2 .

Content

Typo or question?

Metadata