MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Thu, Apr 24th 2025
Lecture 18
Adaptive preconditioning: AdaGrad and ADAM
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
The affine-invariance property of Hessian preconditioning is extremely desirable. For
example, consider the case of an optimization problem in which we have variables with
physical meaning: maybe we are trying to design a bridge, and we have variables that
denote lengths, wind strengths, weights, et cetera. In our modeling of the problem, we might
have decided all length variables are in meters. If we now were to decide to use centimeters
instead, we would like the optimization algorithm to be invariant to this change, that is,
produce the same sequence of iterates regardless of the units we use. While this would be
guaranteed by the Hessian preconditioner, the same cannot be said for gradient descent.
Indeed, let us consider what would happen if we were to move all of our length variables
from meters to centimeters. Since a centimeter is a hundredth of a meter, the objective is
now 100 times less sensitive to a change in length. This means that the gradient of the
objective with respect to the new parameterization of lengths would now be 100 times
smaller than before. And yet, all values of the length variables would be 100 times larger
than before, producing a net effect of slowing down the gradient descent update on each of
the variables by a factor of 10,000 if using the same learning rate! Adjusting the learning
rate might compensate for this, at the expense of now affecting the speed of change of
all other variables, even if they had been left untouched by the “meter → centimeter”
reparameterization.
Example. As a small numerical
example, consider the objective func-
tion (say, parameterized in meters)
𝑓(𝑥, 𝑦) = 1
2 𝑥2 + 1
2 𝑦2. After a change
of units of 𝑥, consider now the repa-
rameterized objective
𝑔(𝑥′, 𝑦) = 𝑓( 𝑥′
√2 , 𝑦) = 1
4𝑥′2 + 1
2𝑦2,
−2 −1 1 2 x
−2
−1
1
2
𝑦
0
𝑓(𝑥, 𝑦)
−2 −1 1 2 𝑥′
−2
−1
1
2
𝑦
0
𝑔(𝑥′, 𝑦)
plotted on the right. The two plots show contour lines for the respective objectives,
together with an identical initial point (up to reparameterization). The red arrows show
the gradient descent direction at the initial point. It is clear that after one step of
gradient descent, the two points will no longer be equivalent.
■ Today’s lecture. In this lecture, we look at preconditioning algorithms that are not based
on using the Hessian. Rather, they are based on the idea of adapting the learning rate for
each parameter based on the historical gradients. Compared to the Hessian preconditioner,
the approach in this lecture is significantly more scalable, and it is particularly useful for
large-scale optimization problems.
In fact, the algorithms we will discuss today currently include the most-used first-order
optimization algorithm in machine learning, ADAM.
L18.1 The AdaGrad algorithm
The AdaGrad algorithm—introduced by Duchi, J., Hazan, E., & Singer, Y. [DHS11]—is a
gradient-based optimization algorithm that adapts the learning rate for each variable based
on the historical gradients.¹ The main idea behind AdaGrad is to scale the learning rate of
each variable based on the sum of the squared gradients accumulated over time. This allows
AdaGrad to give smaller learning rates to frequently updated variables and larger learning
rates to variables with infrequent updates. Going back to the example of the bridge design,
this means that if we were to change the units of the length variables, AdaGrad would
automatically adjust the learning rates to compensate for the change in scale.
In particular, at each iteration 𝑡, AdaGrad keeps a tally of the sum of the squared gradients
up to time 𝑡 for each variable. This is done by maintaining a vector 𝑠𝑡 of components
[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖 ,
where [∇𝑓(𝑥𝑡)]𝑖 is the 𝑖-th component of ∇𝑓(𝑥𝑡) at time 𝑡. AdaGrad’s update rule is then
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
(
23[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖
)
56
. (AdaGrad)
We assume that [∇𝑓(𝑥0)]𝑖 ≠ 0 for all 𝑖, so that 𝑀𝑡 is invertible at all times 𝑡 = 0, 1, 2, ....
Remark L18.1. The same algorithm can be used in the stochastic setting, where as usual
the gradient is replaced by a stochastic gradient,
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ̃ ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
(
23[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[ ̃∇𝑓(𝑥𝜏 )]2
𝑖 : 𝑖 = 1, ..., 𝑛
)
56
It can also be used in the projected setting, where the update is projected onto a feasible
set. For simplicity, in this lecture, we will focus on the deterministic, unconstrained
setting for our analysis.
L18.2 ADAM: AdaGrad with momentum
In practice, people often use a variant of AdaGrad called ADAM, introduced by Kingma,
D. P., & Ba, J. [KB15]. ADAM combines the adaptive learning rate of AdaGrad with the
idea of momentum we already saw in Lecture 13. In particular, at each iteration 𝑡, ADAM
keeps track of the momentum (discounted sum of past gradients)
𝑔𝑡 = 𝛾𝑔𝑡−1 + (1 − 𝛾)∇𝑓(𝑥𝑡); 𝑔−1 ≔ 0.
The scaling factors 𝑠𝑡 are also accumulated with a discount rate 𝛽 as
[𝑠𝑡]2
𝑖 = 𝛽[𝑠𝑡−1]2
𝑖 + (1 − 𝛽)[∇𝑓(𝑥𝑡)]2
𝑖 𝑖 = 1, ..., 𝑛; 𝑠−1 ≔ 0.
Finally, ADAM updates the iterate as follows:
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑀 −1
𝑡 𝑔𝑡, where 𝑀𝑡 ≔ diag(𝑠𝑡) . (ADAM)
The hyperparameters 𝜂, 𝛾, and 𝛽 in ADAM are typically set to 0.001, 0.9, and 0.999
respectively (this is pytorch’s default behavior).
Remark L18.2. The ADAM algorithm is widely used in practice and is known to work
well for a wide range of optimization problems. It is particularly useful for training deep
neural networks. However, ADAM does not have theoretical guarantees like AdaGrad.
It is even known to diverge in some cases, even with convex objectives [RKK18].
L18.3 Analysis of AdaGrad
In this section, we will analyze the AdaGrad algorithm. For simplicity, we will focus on
the non-stochastic version, though the analysis in the presence of stochastic gradients is
analogous.
The main result we will prove is that AdaGrad is competitive with the best possible
preconditioner in hindsight, as we make precise in the next theorem.
Theorem L18.1. Let 𝑓 : ℝ𝑛 → ℝ be a convex and differentiable function. AdaGrad is
competitive with the best preconditioner in hindsight. More precisely, for any choice of
coefficients 𝜆𝑖 ≥ 0, 𝑖 = 1, ..., 𝑛, the AdaGrad algorithm with stepsize 𝜂 = 𝐷/√2 satisfies
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) +
√2𝑛𝐷
𝑇 √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛 ∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡),
where 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is the maximum distance from the optimal solution at
all times 𝑇 .
In the rest of this section, we prove the above result.
L18.3.1 AdaGrad as an instance of mirror descent
The main idea behind the proof is to show that AdaGrad is a form of mirror descent algo-
rithm (Lecture 14), with the twist that the distance-generating function is time-dependent.
In particular, we will use as our distance-generating function the (strongly convex) function
𝜑𝑡(𝑥) ≔ 1
2 𝑥⊤𝑀𝑡𝑥.
The induced Bregman divergence is
D𝜑𝑡 (𝑥 ‖ 𝑦) = 𝜑𝑡(𝑥) − 𝜑𝑡(𝑦) − ⟨∇𝜑𝑡(𝑦), 𝑥 − 𝑦⟩ = 1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦).
(The above is a generalization of the result we saw in Lecture 14, which established that
the Bregman divergence induced by the squared Euclidean norm is the squared Euclidean
distance.)
We make the connection formal in the following lemma.
Theorem L18.2. The AdaGrad update rule is equivalent to the mirror descent update
rule with the distance-generating function 𝜑𝑡(𝑥) = 1
2 𝑥⊤𝑀𝑡𝑥, where 𝑀𝑡 ≔ diag(𝑠𝑡).
Proof. Remember that the mirror descent update rule is
𝑥𝑡+1 = arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + D𝜑𝑡 (𝑥 ‖ 𝑥𝑡)}
= arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + 1
2 (𝑥 − 𝑥𝑡)⊤𝑀𝑡(𝑥 − 𝑥𝑡)}.
Setting the gradient of the above objective to zero and solving for 𝑥 yields
𝜂∇𝑓(𝑥𝑡) + 𝑀𝑡(𝑥 − 𝑥𝑡) = 0 ⟹ 𝑥 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡),
which is exactly the AdaGrad update rule. □
From the mirror descent lemma we saw in Lecture 14, we can write
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝜂 (D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1) + D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)).
Summing the above inequalities over 𝑡 = 0, 1, ..., 𝑇 − 1 yields
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
𝜂 (∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1)) + ∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1))
≤ 𝑇 𝑓(𝑥⋆) + 1
𝜂 (∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡))
⏟ppppppqppppppr
A
+ ∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)
⏟pppqpppr
B
),
under the convention that 𝑠−1 ≔ 0 (that is, 𝑀−1 ≔ 0 and 𝜑−1 is the identically zero
function). We will now proceed, in the next two subsections, to bound the two summations
A and B separately.
L18.3.2 Bounding the “almost-telescopic” terms A
Theorem L18.3. Let 𝑇 be arbitrary and assume that 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is finite.
Then, at all times 𝑇 , the sum A satisfies the inequality
∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡)) ≤ 𝐷2
2 ‖𝑠𝑇 −1‖1.
Proof. From direct verification,
D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡) = 1
2 (𝑥𝑡 − 𝑥⋆)⊤(𝑀𝑡 − 𝑀𝑡−1)(𝑥𝑡 − 𝑥⋆).
Using now the definition of 𝐷 together with the Cauchy-Schwarz inequality, we can write
1
2(𝑥𝑡 − 𝑥⋆)⊤(𝑀𝑡 − 𝑀𝑡−1)(𝑥𝑡 − 𝑥⋆) ≤ 1
2 ‖𝑥𝑡 − 𝑥⋆‖2
∞‖𝑠𝑡 − 𝑠𝑡−1‖1
≤ 𝐷2
2 ‖𝑠𝑡 − 𝑠𝑡−1‖1.
On the other hand, since 𝑠𝑡 ≥ 𝑠𝑡−1 ≥ 0 componentwise at all times 𝑡, then ‖𝑠𝑡 − 𝑠𝑡−1‖1 =
‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1 and we can write
∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡)) ≤ 𝐷2
2 ∑
𝑇 −1
𝑡=0
(‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1)
= 𝐷2
2 (‖𝑠𝑇 −1‖1 − ‖𝑠−1‖1)
= 𝐷2
2 ‖𝑠𝑇 −1‖1,
which is the statement. □
L18.3.3 Bounding the summation B
We now shift our attention to the summation in B , where we establish the following bound.
Theorem L18.4. At all times 𝑇 , the sum B satisfies the inequality
∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) ≤ 𝜂2‖𝑠𝑇 −1‖1.
Proof. Expanding the expression for the Bregman divergence D𝜑𝑡 (𝑥 ‖ 𝑦) =
1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦), we have
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) = 1
2(𝑥𝑡 − 𝑥𝑡+1)⊤𝑀𝑡(𝑥𝑡 − 𝑥𝑡+1) = 𝜂2
2 ∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡).
Using the definition of 𝑀𝑡 ≔ diag(𝑠𝑡) where [𝑠𝑡]𝑖 ≔ √∑𝑡
𝜏=0 [∇𝑓(𝑥𝜏 )]2
𝑖 , we can then write
∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡) = ∑
𝑛
𝑖=1
[∇𝑓(𝑥𝑡)]2
𝑖
[𝑠𝑡]𝑖
= ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖
≤ 2 ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖 + [𝑠𝑡−1]𝑖
= 2 ∑
𝑛
𝑖=1
([𝑠𝑡]𝑖 − [𝑠𝑡−1]𝑖) = 2(‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1).
Summing over 𝑡 = 0, 1, ..., 𝑇 − 1 yields the result. □
L18.3.4 Finale: Bounding the norm of the scaling factors
The two bounds above show that
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝑇 ( 𝐷2
2𝜂 + 𝜂)‖𝑠𝑇 −1‖1.
Hence, setting 𝜂 = 𝐷/√2 yields
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 𝐷√2
𝑇 ‖𝑠𝑇 −1‖1. (3)
So, to complete the proof, we only need to provide a bound on the norm of the scaling
factors 𝑠𝑇 −1. This is the content of the following theorem.
Theorem L18.5. The norm of the scaling factors 𝑠𝑇 −1 satisfies the inequality
‖𝑠𝑇 −1‖1 ≤ √𝑛 ⋅ √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛 ∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡).
Proof. Pick any 𝜆 ∈ ℝ𝑛
≥0, ‖𝜆‖1 = 𝑛; we will use Cauchy-Scharz to bound ‖𝑠𝑇 −1‖2
1 as
follows:
‖𝑠𝑇 −1‖2
1 =
(
23∑
𝑛
𝑖=1
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
)
56
2
=
(
23∑
𝑛
𝑖=1 [
x
y√𝜆𝑖 ⋅
(
23 1
√𝜆𝑖
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
)
56
]
|
}
)
56
2
≤ (∑
𝑛
𝑖=1
𝜆𝑖) ⋅ (∑
𝑛
𝑖=1
( 1
𝜆𝑖
∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖 )) (Cauchy-Schwarz inequality)
= 𝑛 ⋅ (∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡)).
Taking square roots and using the fact that 𝜆 was arbitrary yields the statement. □
Plugging the bound of Theorem L18.5 into (3) proves Theorem L18.1.
Bibliography for this lecture
[DHS11] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods
for online learning and stochastic optimization. Journal of Machine Learning
Research, 12(7).
[KB15] Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations (ICLR). http://arxiv.
org/abs/1412.6980
[RKK18] Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and
beyond. International Conference on Learning Representations (ICLR).
Changelog
• Apr 28, 2025: Fixed typo 𝜆𝑗 → 𝜆𝑖 (Thanks Alina!)
• May 11, 2025: Fixed typos (thanks Khizer!)
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹Today we will consider the version of AdaGrad that only uses diagonal preconditioners, which is the
version that is typically preferred in practice for large problems, including in machine learning applications
(we will still refer to the algorithm with the generic name “AdaGrad”).
Lecture 18
Adaptive preconditioning: AdaGrad and ADAM
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
The affine-invariance property of Hessian preconditioning is extremely desirable. For
example, consider the case of an optimization problem in which we have variables with
physical meaning: maybe we are trying to design a bridge, and we have variables that
denote lengths, wind strengths, weights, et cetera. In our modeling of the problem, we might
have decided all length variables are in meters. If we now were to decide to use centimeters
instead, we would like the optimization algorithm to be invariant to this change, that is,
produce the same sequence of iterates regardless of the units we use. While this would be
guaranteed by the Hessian preconditioner, the same cannot be said for gradient descent.
Indeed, let us consider what would happen if we were to move all of our length variables
from meters to centimeters. Since a centimeter is a hundredth of a meter, the objective is
now 100 times less sensitive to a change in length. This means that the gradient of the
objective with respect to the new parameterization of lengths would now be 100 times
smaller than before. And yet, all values of the length variables would be 100 times larger
than before, producing a net effect of slowing down the gradient descent update on each of
the variables by a factor of 10,000 if using the same learning rate! Adjusting the learning
rate might compensate for this, at the expense of now affecting the speed of change of
all other variables, even if they had been left untouched by the “meter → centimeter”
reparameterization.
Example. As a small numerical
example, consider the objective func-
tion (say, parameterized in meters)
𝑓(𝑥, 𝑦) = 1
2 𝑥2 + 1
2 𝑦2. After a change
of units of 𝑥, consider now the repa-
rameterized objective
𝑔(𝑥′, 𝑦) = 𝑓( 𝑥′
√2 , 𝑦) = 1
4𝑥′2 + 1
2𝑦2,
−2 −1 1 2 x
−2
−1
1
2
𝑦
0
𝑓(𝑥, 𝑦)
−2 −1 1 2 𝑥′
−2
−1
1
2
𝑦
0
𝑔(𝑥′, 𝑦)
plotted on the right. The two plots show contour lines for the respective objectives,
together with an identical initial point (up to reparameterization). The red arrows show
the gradient descent direction at the initial point. It is clear that after one step of
gradient descent, the two points will no longer be equivalent.
■ Today’s lecture. In this lecture, we look at preconditioning algorithms that are not based
on using the Hessian. Rather, they are based on the idea of adapting the learning rate for
each parameter based on the historical gradients. Compared to the Hessian preconditioner,
the approach in this lecture is significantly more scalable, and it is particularly useful for
large-scale optimization problems.
In fact, the algorithms we will discuss today currently include the most-used first-order
optimization algorithm in machine learning, ADAM.
L18.1 The AdaGrad algorithm
The AdaGrad algorithm—introduced by Duchi, J., Hazan, E., & Singer, Y. [DHS11]—is a
gradient-based optimization algorithm that adapts the learning rate for each variable based
on the historical gradients.¹ The main idea behind AdaGrad is to scale the learning rate of
each variable based on the sum of the squared gradients accumulated over time. This allows
AdaGrad to give smaller learning rates to frequently updated variables and larger learning
rates to variables with infrequent updates. Going back to the example of the bridge design,
this means that if we were to change the units of the length variables, AdaGrad would
automatically adjust the learning rates to compensate for the change in scale.
In particular, at each iteration 𝑡, AdaGrad keeps a tally of the sum of the squared gradients
up to time 𝑡 for each variable. This is done by maintaining a vector 𝑠𝑡 of components
[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖 ,
where [∇𝑓(𝑥𝑡)]𝑖 is the 𝑖-th component of ∇𝑓(𝑥𝑡) at time 𝑡. AdaGrad’s update rule is then
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
(
23[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖
)
56
. (AdaGrad)
We assume that [∇𝑓(𝑥0)]𝑖 ≠ 0 for all 𝑖, so that 𝑀𝑡 is invertible at all times 𝑡 = 0, 1, 2, ....
Remark L18.1. The same algorithm can be used in the stochastic setting, where as usual
the gradient is replaced by a stochastic gradient,
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ̃ ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
(
23[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[ ̃∇𝑓(𝑥𝜏 )]2
𝑖 : 𝑖 = 1, ..., 𝑛
)
56
It can also be used in the projected setting, where the update is projected onto a feasible
set. For simplicity, in this lecture, we will focus on the deterministic, unconstrained
setting for our analysis.
L18.2 ADAM: AdaGrad with momentum
In practice, people often use a variant of AdaGrad called ADAM, introduced by Kingma,
D. P., & Ba, J. [KB15]. ADAM combines the adaptive learning rate of AdaGrad with the
idea of momentum we already saw in Lecture 13. In particular, at each iteration 𝑡, ADAM
keeps track of the momentum (discounted sum of past gradients)
𝑔𝑡 = 𝛾𝑔𝑡−1 + (1 − 𝛾)∇𝑓(𝑥𝑡); 𝑔−1 ≔ 0.
The scaling factors 𝑠𝑡 are also accumulated with a discount rate 𝛽 as
[𝑠𝑡]2
𝑖 = 𝛽[𝑠𝑡−1]2
𝑖 + (1 − 𝛽)[∇𝑓(𝑥𝑡)]2
𝑖 𝑖 = 1, ..., 𝑛; 𝑠−1 ≔ 0.
Finally, ADAM updates the iterate as follows:
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑀 −1
𝑡 𝑔𝑡, where 𝑀𝑡 ≔ diag(𝑠𝑡) . (ADAM)
The hyperparameters 𝜂, 𝛾, and 𝛽 in ADAM are typically set to 0.001, 0.9, and 0.999
respectively (this is pytorch’s default behavior).
Remark L18.2. The ADAM algorithm is widely used in practice and is known to work
well for a wide range of optimization problems. It is particularly useful for training deep
neural networks. However, ADAM does not have theoretical guarantees like AdaGrad.
It is even known to diverge in some cases, even with convex objectives [RKK18].
L18.3 Analysis of AdaGrad
In this section, we will analyze the AdaGrad algorithm. For simplicity, we will focus on
the non-stochastic version, though the analysis in the presence of stochastic gradients is
analogous.
The main result we will prove is that AdaGrad is competitive with the best possible
preconditioner in hindsight, as we make precise in the next theorem.
Theorem L18.1. Let 𝑓 : ℝ𝑛 → ℝ be a convex and differentiable function. AdaGrad is
competitive with the best preconditioner in hindsight. More precisely, for any choice of
coefficients 𝜆𝑖 ≥ 0, 𝑖 = 1, ..., 𝑛, the AdaGrad algorithm with stepsize 𝜂 = 𝐷/√2 satisfies
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) +
√2𝑛𝐷
𝑇 √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛 ∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡),
where 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is the maximum distance from the optimal solution at
all times 𝑇 .
In the rest of this section, we prove the above result.
L18.3.1 AdaGrad as an instance of mirror descent
The main idea behind the proof is to show that AdaGrad is a form of mirror descent algo-
rithm (Lecture 14), with the twist that the distance-generating function is time-dependent.
In particular, we will use as our distance-generating function the (strongly convex) function
𝜑𝑡(𝑥) ≔ 1
2 𝑥⊤𝑀𝑡𝑥.
The induced Bregman divergence is
D𝜑𝑡 (𝑥 ‖ 𝑦) = 𝜑𝑡(𝑥) − 𝜑𝑡(𝑦) − ⟨∇𝜑𝑡(𝑦), 𝑥 − 𝑦⟩ = 1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦).
(The above is a generalization of the result we saw in Lecture 14, which established that
the Bregman divergence induced by the squared Euclidean norm is the squared Euclidean
distance.)
We make the connection formal in the following lemma.
Theorem L18.2. The AdaGrad update rule is equivalent to the mirror descent update
rule with the distance-generating function 𝜑𝑡(𝑥) = 1
2 𝑥⊤𝑀𝑡𝑥, where 𝑀𝑡 ≔ diag(𝑠𝑡).
Proof. Remember that the mirror descent update rule is
𝑥𝑡+1 = arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + D𝜑𝑡 (𝑥 ‖ 𝑥𝑡)}
= arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + 1
2 (𝑥 − 𝑥𝑡)⊤𝑀𝑡(𝑥 − 𝑥𝑡)}.
Setting the gradient of the above objective to zero and solving for 𝑥 yields
𝜂∇𝑓(𝑥𝑡) + 𝑀𝑡(𝑥 − 𝑥𝑡) = 0 ⟹ 𝑥 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡),
which is exactly the AdaGrad update rule. □
From the mirror descent lemma we saw in Lecture 14, we can write
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝜂 (D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1) + D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)).
Summing the above inequalities over 𝑡 = 0, 1, ..., 𝑇 − 1 yields
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
𝜂 (∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1)) + ∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1))
≤ 𝑇 𝑓(𝑥⋆) + 1
𝜂 (∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡))
⏟ppppppqppppppr
A
+ ∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)
⏟pppqpppr
B
),
under the convention that 𝑠−1 ≔ 0 (that is, 𝑀−1 ≔ 0 and 𝜑−1 is the identically zero
function). We will now proceed, in the next two subsections, to bound the two summations
A and B separately.
L18.3.2 Bounding the “almost-telescopic” terms A
Theorem L18.3. Let 𝑇 be arbitrary and assume that 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is finite.
Then, at all times 𝑇 , the sum A satisfies the inequality
∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡)) ≤ 𝐷2
2 ‖𝑠𝑇 −1‖1.
Proof. From direct verification,
D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡) = 1
2 (𝑥𝑡 − 𝑥⋆)⊤(𝑀𝑡 − 𝑀𝑡−1)(𝑥𝑡 − 𝑥⋆).
Using now the definition of 𝐷 together with the Cauchy-Schwarz inequality, we can write
1
2(𝑥𝑡 − 𝑥⋆)⊤(𝑀𝑡 − 𝑀𝑡−1)(𝑥𝑡 − 𝑥⋆) ≤ 1
2 ‖𝑥𝑡 − 𝑥⋆‖2
∞‖𝑠𝑡 − 𝑠𝑡−1‖1
≤ 𝐷2
2 ‖𝑠𝑡 − 𝑠𝑡−1‖1.
On the other hand, since 𝑠𝑡 ≥ 𝑠𝑡−1 ≥ 0 componentwise at all times 𝑡, then ‖𝑠𝑡 − 𝑠𝑡−1‖1 =
‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1 and we can write
∑
𝑇 −1
𝑡=0
(D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡−1 (𝑥⋆ ‖ 𝑥𝑡)) ≤ 𝐷2
2 ∑
𝑇 −1
𝑡=0
(‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1)
= 𝐷2
2 (‖𝑠𝑇 −1‖1 − ‖𝑠−1‖1)
= 𝐷2
2 ‖𝑠𝑇 −1‖1,
which is the statement. □
L18.3.3 Bounding the summation B
We now shift our attention to the summation in B , where we establish the following bound.
Theorem L18.4. At all times 𝑇 , the sum B satisfies the inequality
∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) ≤ 𝜂2‖𝑠𝑇 −1‖1.
Proof. Expanding the expression for the Bregman divergence D𝜑𝑡 (𝑥 ‖ 𝑦) =
1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦), we have
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) = 1
2(𝑥𝑡 − 𝑥𝑡+1)⊤𝑀𝑡(𝑥𝑡 − 𝑥𝑡+1) = 𝜂2
2 ∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡).
Using the definition of 𝑀𝑡 ≔ diag(𝑠𝑡) where [𝑠𝑡]𝑖 ≔ √∑𝑡
𝜏=0 [∇𝑓(𝑥𝜏 )]2
𝑖 , we can then write
∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡) = ∑
𝑛
𝑖=1
[∇𝑓(𝑥𝑡)]2
𝑖
[𝑠𝑡]𝑖
= ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖
≤ 2 ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖 + [𝑠𝑡−1]𝑖
= 2 ∑
𝑛
𝑖=1
([𝑠𝑡]𝑖 − [𝑠𝑡−1]𝑖) = 2(‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1).
Summing over 𝑡 = 0, 1, ..., 𝑇 − 1 yields the result. □
L18.3.4 Finale: Bounding the norm of the scaling factors
The two bounds above show that
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝑇 ( 𝐷2
2𝜂 + 𝜂)‖𝑠𝑇 −1‖1.
Hence, setting 𝜂 = 𝐷/√2 yields
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 𝐷√2
𝑇 ‖𝑠𝑇 −1‖1. (3)
So, to complete the proof, we only need to provide a bound on the norm of the scaling
factors 𝑠𝑇 −1. This is the content of the following theorem.
Theorem L18.5. The norm of the scaling factors 𝑠𝑇 −1 satisfies the inequality
‖𝑠𝑇 −1‖1 ≤ √𝑛 ⋅ √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛 ∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡).
Proof. Pick any 𝜆 ∈ ℝ𝑛
≥0, ‖𝜆‖1 = 𝑛; we will use Cauchy-Scharz to bound ‖𝑠𝑇 −1‖2
1 as
follows:
‖𝑠𝑇 −1‖2
1 =
(
23∑
𝑛
𝑖=1
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
)
56
2
=
(
23∑
𝑛
𝑖=1 [
x
y√𝜆𝑖 ⋅
(
23 1
√𝜆𝑖
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
)
56
]
|
}
)
56
2
≤ (∑
𝑛
𝑖=1
𝜆𝑖) ⋅ (∑
𝑛
𝑖=1
( 1
𝜆𝑖
∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖 )) (Cauchy-Schwarz inequality)
= 𝑛 ⋅ (∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓(𝑥𝑡)).
Taking square roots and using the fact that 𝜆 was arbitrary yields the statement. □
Plugging the bound of Theorem L18.5 into (3) proves Theorem L18.1.
Bibliography for this lecture
[DHS11] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods
for online learning and stochastic optimization. Journal of Machine Learning
Research, 12(7).
[KB15] Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations (ICLR). http://arxiv.
org/abs/1412.6980
[RKK18] Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and
beyond. International Conference on Learning Representations (ICLR).
Changelog
• Apr 28, 2025: Fixed typo 𝜆𝑗 → 𝜆𝑖 (Thanks Alina!)
• May 11, 2025: Fixed typos (thanks Khizer!)
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹Today we will consider the version of AdaGrad that only uses diagonal preconditioners, which is the
version that is typically preferred in practice for large problems, including in machine learning applications
(we will still refer to the algorithm with the generic name “AdaGrad”).