Gabriele Farina - Projected gradient descent and mirror descent

MIT 6.7220/15.084 — Nonlinear Optimization (Spring ‘25) Apr 8–10th 2025
Lecture 14
Projected gradient descent and mirror descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
We continue our exploration of first-order methods by considering the case of constrained
optimization problems of the form
min
𝑥
s.t.
𝑓(𝑥)
𝑥 ∈ Ω ⊆ ℝ𝑛,
where 𝑓 is differentiable and Ω is a closed and convex set (with only one technical exception
that we will highlight later).
L14.1 Projected gradient descent
When applied without modifications to a constrained optimization problem, the gradient
descent algorithm might quickly produce iterates 𝑥𝑡 that leave the feasible set. The idea
behind projected gradient descent is very intuitive: to avoid infeasible iterates, project the
iterates of gradient descent back onto Ω at every iteration. This leads to the update rule
𝑥𝑡+1 ≔ ΠΩ(𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡)) , (1)
where the operation ΠΩ denotes Euclidean projection onto Ω. (Remember that projection
onto a closed convex set exists and is unique.)
As it turns out, the projected gradient descent algorithm behaves fundamentally like the
gradient descent algorithm. In particular, the gradient descent lemma and the Euclidean
mirror descent lemma can be generalized to projected gradient descent with little effort.
As a corollary, the same convergence rate of 𝑓(𝑥𝑡) − 𝑓(𝑥⋆) ≤ 1
𝜂𝑡 ‖𝑥0 − 𝑥⋆‖2
2 for 𝐿-smooth
functions when 0 < 𝜂 ≤ 1
𝐿 applies to projected gradient descent as well.
Instead of developing the results for projected gradient descent, we introduce a generaliza-
tion of the algorithm that is more permissive to the projection notion used. The correctness
of the projected gradient descent will then follow as a direct corollary of this generalization.
L14.2 Generalized projection: The proximal step
Depending on the feasible set Ω, Euclidean projections onto Ω might be expensive to
compute in practice. A generalization of projected gradient descent called mirror descent
affords more flexibility in the notion of distance used in the projection.
L14.2.1 Distance-generating functions (DGFs)
An interesting generalization of the distance between two points is that of a Bregman
divergence (also called Bregman distance). A Bregman divergence is not a distance in the
technical sense—for example, it is not necessarily symmetric, and it need not satisfy the
triangle inequality.
A Bregman divergence is constructed starting from a strongly convex function 𝜑, called the
distance-generating function (DGF) for the divergence.
Definition L14.1. Let 𝜑 : Ω → ℝ be a differentiable and 𝜇-strongly convex (𝜇 > 0)
function with respect to a norm ‖·‖, that is, satisfy
𝜑(𝑥) ≥ 𝜑(𝑦) + ⟨∇𝜑(𝑦), 𝑥 − 𝑦⟩ + 𝜇
2 ‖𝑥 − 𝑦‖2 ∀𝑥, 𝑦 ∈ Ω.
The Bregman divergence centered in 𝑦 ∈ Ω is the function D𝜑(𝑥 ‖ 𝑦) defined as
D𝜑(𝑥 ‖ 𝑦) ≔ 𝜑(𝑥) − 𝜑(𝑦) − ⟨∇𝜑(𝑦), 𝑥 − 𝑦⟩.
Note that from its very definition it is clear that
D𝜑(𝑥 ‖ 𝑥) = 0 ∀𝑥 ∈ Ω and D𝜑(𝑥 ‖ 𝑦) ≥ 𝜇
2 ‖𝑥 − 𝑦‖2 ∀𝑥, 𝑦 ∈ Ω, (2)
which in particular implies that D𝜑(𝑥 ‖ 𝑦) = 0 if and only if 𝑥 = 𝑦. We now mention two
very important special cases of Bregman divergences.
• When Ω is arbitrary and the DGF 𝜑 is set to be the is the squared Euclidean norm
𝜑(𝑥) ≔ 1
2 ‖𝑥‖2
2,
which is 1-strongly convex with respect to ‖·‖2, the corresponding Bregman divergence
is the squared Euclidean distance
D𝜑(𝑥 ‖ 𝑦) = 1
2 ‖𝑥 − 𝑦‖2
2.
Indeed, from the definition,
D𝜑(𝑥 ‖ 𝑦) = 1
2‖𝑥‖2
2 − 1
2‖𝑦‖2
2 − ⟨𝑦, 𝑥 − 𝑦⟩ = 1
2‖𝑥‖2
2 − 1
2‖𝑦‖2
2 − ⟨𝑦, 𝑥⟩ + ‖𝑦‖2
2 = 1
2‖𝑥 − 𝑦‖2
2.
• When Ω = ̊ Δ𝑛 is the set of full-support distributions over 𝑛 objects,¹ and the distance-
generating function 𝜑 is set to the negative entropy function
𝜑(𝑥) ≔ ∑
𝑛
𝑖=1
𝑥𝑖 log 𝑥𝑖,
which is 1-strongly convex with respect to the ℓ1 norm ‖·‖1, [▷ You should check this!]
the corresponding Bregman divergence is the Kullback-Leibler (KL) divergence [▷ And
this too!]
D𝜑(𝑥 ‖ 𝑦) = ∑
𝑛
𝑖=1
𝑥𝑖 log 𝑥𝑖
𝑦𝑖
,
a commonly used notion of distance between distributions in machine learning and
statistics.
A useful fact about Bregman divergences is that for any center 𝑦, they are as strongly
convex in 𝑥 as the original distance-generating function 𝜑. More precisely, we have the
following.
Theorem L14.1. Let 𝜑 : Ω → ℝ be differentiable and 𝜇-strongly convex with respect to
a norm ‖·‖. For any 𝑦 ∈ Ω, the function 𝑥 ↦ D𝜑(𝑥 ‖ 𝑦) is 𝜇-strongly convex with respect
to ‖·‖, that is,
D𝜑(𝑥′ ‖ 𝑦) ≥ D𝜑(𝑥 ‖ 𝑦) + ⟨∇𝑥D𝜑(𝑥 ‖ 𝑦), 𝑥′ − 𝑥⟩ + 𝜇
2 ‖𝑥′ − 𝑥‖2 ∀𝑥, 𝑥′ ∈ Ω.
Proof. Using the definition of the Bregman divergence D𝜑(· ‖ ·), we have
∇𝑥D𝜑(𝑥 ‖ 𝑦) = ∇𝜑(𝑥) − ∇𝜑(𝑦),
so after expanding the definition of D𝜑(· ‖ ·) in the inequality of the statement, the
statement is
𝜑(𝑥′) ≥ 𝜑(𝑥) + ⟨∇𝜑(𝑦), 𝑥′ − 𝑥⟩ + ⟨∇𝜑(𝑥) − ∇𝜑(𝑦), 𝑥′ − 𝑥⟩ + 𝜇
2 ‖𝑥′ − 𝑥‖2 ∀𝑥, 𝑥′ ∈ Ω,
that is, 𝜑(𝑥′) ≥ 𝜑(𝑥) + ⟨∇𝜑(𝑥), 𝑥′ − 𝑥⟩ + 𝜇
2 ‖𝑥′ − 𝑥‖2 for all 𝑥, 𝑥′ ∈ Ω, which follows by
the assumption of 𝜇-strong convexity with respect to ‖·‖ for 𝜑. □
L14.2.2 Proximal steps
Proximal steps generalize the steps followed by the projected gradient descent algorithm
(1). The key insight is the following: instead of interpreting (1) as the projection of the point
𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡), we can interpret 𝑥𝑡+1 as just another manifestation of the key principle of
gradient descent-type algorithms: hoping that the objective can be approximated well with
its first-order Taylor expansion in a neighborhood of each point. It then follows naturally
that each updated point 𝑥𝑡+1 produced by a gradient descent-type algorithm should trade
off two competing objectives:
• moving as much as possible in the direction −∇𝑓(𝑥𝑡); and
• staying in a neighborhood of Ω centered around point 𝑥𝑡, so as to not move too far.
The stepsize parameter 𝜂 > 0 controls the tradeoff between the competing objectives.
When using a generic Bregman divergence D𝜑(· ‖ ·) as the notion of distance, the tradeoff
between these two competing objectives can be formalized as the proximal step problem
Prox𝜑(𝜂∇𝑓(𝑥𝑡), 𝑥𝑡) ≔ arg min
𝑥 𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + D𝜑(𝑥 ‖ 𝑥𝑡)
s.t. 𝑥 ∈ Ω.
We show in the next subsection that proximal steps are well-defined—that is, the solution
to the optimization problem above exists and is unique. This leads to the mirror descent
algorithm, defined by the update
𝑥𝑡+1 ≔ Prox𝜑(𝜂∇𝑓(𝑥𝑡), 𝑥𝑡) . (3)
■ The Euclidean DGF recovers Euclidean projection. As a sanity check to convince ourselves
that the abstraction of proximal step is reasonable, we can verify that it generalizes the
steps of projected gradient descent (1)—and therefore also of gradient descent, which is
just projected gradient descent in which Ω = ℝ𝑛. We do so in the next theorem.
Theorem L14.2. Consider the squared Euclidean norm distance-generating function
𝜑(𝑥) = 1
2 ‖𝑥‖2
2. Then, proximal steps and projected gradient steps (1) are equivalent:
Prox𝜑(𝜂∇𝑓(𝑥), 𝑥) = ΠΩ(𝑥 − 𝜂∇𝑓(𝑥)) ∀𝑥 ∈ Ω.
Proof. The Euclidean projection problem is given by
min
𝑦
s.t.
1
2 ‖𝑦 − 𝑥 + 𝜂∇𝑓(𝑥)‖2
2
𝑦 ∈ Ω.
Expanding the squared Euclidean norm in the objective and removing terms that do not
depend on the optimization variable 𝑦 we can rewrite the problem as
min
𝑦
s.t.
1
2 ‖𝑦 − 𝑥‖2
2 + 𝜂⟨∇𝑓(𝑥), 𝑦 − 𝑥⟩
𝑦 ∈ Ω,
which is exactly the proximal step problem since D𝜑(𝑦 ‖ 𝑥) = 1
2 ‖𝑦 − 𝑥‖2
2 as observed in
the previous section. □
■ The negative entropy DGF recovers the softmax update. Proximal steps are very useful
when computing Euclidean projections is expensive. For example, in the case of the negative
entropy distance-generating function for full-support distributions, we can use the result in
Lecture 2 to show that the proximal step corresponds to the softmax update
𝑥𝑡+1 ∝ 𝑥𝑡 ⊙ exp{−𝜂∇𝑓(𝑥𝑡)},
where ⊙ denotes elementwise product. [▷ Try to work out the details!] Such a generalized
notion of projection is significantly more practical than the algorithm for Euclidean projec-
tion you developed in Homework 1.
L14.2.3 Properties of proximal steps
We now mention a few important properties of proximal steps.
■ Proximal steps exist. The argument is analogous to the one we used in Lecture 1 to
argue the existence of Euclidean projections. Consider a generic proximal step Prox𝜑(𝑔, 𝑦),
in which the objective function to be minimized over 𝑥 ∈ Ω is accordingly
ℎ(𝑥) ≔ ⟨𝑔, 𝑥⟩ + D𝜑(𝑥 ‖ 𝑦).
The infimum of the function over Ω must be less than or equal to the value of the objective
in the valid choice 𝑥 = 𝑦.
To apply the Weierstrass theorem, we must show that we can safely restrict the domain to
a compact subset. To do so, we can use the knowledge, from (2), that D𝜑(𝑥 ‖ 𝑦) ≥ 𝜇
2 ‖𝑥 − 𝑦‖2
for all 𝑥 ∈ Ω, where 𝜇 > 0 and ‖·‖ are the strong convexity parameter and strong convexity
norm of the underlying DGF 𝜑. The value of the increment ℎ(𝑥) − ℎ(𝑦) can therefore be
lower-bounded using the generalized Cauchy-Schwarz inequality as
ℎ(𝑥) − ℎ(𝑦) ≥ ⟨𝑔, 𝑥 − 𝑦⟩ + 𝜇
2 ‖𝑥 − 𝑦‖2 ≥ 𝜇
2 ‖𝑥 − 𝑦‖ ⋅ (‖𝑥 − 𝑦‖ − 2
𝜇‖𝑔‖∗).
This shows that any point 𝑥 such that ‖𝑥 − 𝑦‖ ≥ 2
𝜇 ‖𝑔‖∗ we have that ℎ(𝑥) ≥ ℎ(𝑦). Therefore,
we can restrict the minimization of ℎ(𝑥) to the compact set defined by the intersection
between Ω and the closed ball of radius ‖𝑔‖∗ centered in 𝑦, and Weierstrass guarantees the
existence of a minimizer of 𝑓 in this compact restriction of the domain.
■ Proximal steps are unique. The objective function minimized in proximal steps is
defined as the sum of a linear function plus a Bregman divergence with a fixed center. Since
Bregman divergences are strongly convex by Theorem L14.1, and linear terms do not affect
strong convexity, the proximal step problem minimizes a strongly convex objective on a
convex set. The uniqueness of the solution is therefore guaranteed (see Lecture 4).
■ The three-point equality for proximal steps. The following property is key in many
proofs involving proximal steps. For that reason, we give it a name.
Theorem L14.3 (Three-point inequality for proximal steps). Consider a generic proximal
set
𝑥′ = Prox𝜑(𝑔, 𝑥).
Then,
⟨−𝑔, 𝑦 − 𝑥′⟩ ≤ −D𝜑(𝑦 ‖ 𝑥′) + D𝜑(𝑦 ‖ 𝑥) − D𝜑(𝑥′ ‖ 𝑥) ∀𝑦 ∈ Ω.
Proof. The objective function of the proximal step problem is given by
ℎ(𝑧) ≔ ⟨𝑔, 𝑧⟩ + D𝜑(𝑧 ‖ 𝑥), 𝑧 ∈ Ω.
The first-order optimality conditions applied to the solution 𝑧 = 𝑥′ are therefore
−∇ℎ(𝑥′) ∈ 𝒩Ω(𝑥′) ⟺ −𝑔 − ∇𝜑(𝑥′) + ∇𝜑(𝑥) ∈ 𝒩Ω(𝑥′)
⟺ ⟨−𝑔 − ∇𝜑(𝑥′) + ∇𝜑(𝑥), 𝑦 − 𝑥′⟩ ≤ 0 ∀𝑦 ∈ Ω
⟺ ⟨−𝑔, 𝑦 − 𝑥′⟩ ≤ ⟨∇𝜑(𝑥′) − ∇𝜑(𝑥), 𝑦 − 𝑥′⟩ ∀𝑦 ∈ Ω.
The statement now follows by using the identity
⟨∇𝜑(𝑥′) − ∇𝜑(𝑥), 𝑦 − 𝑥′⟩ = −D𝜑(𝑦 ‖ 𝑥′) + D𝜑(𝑦 ‖ 𝑥) − D𝜑(𝑥′ ‖ 𝑥),
which can be checked directly from the definition of Bregman divergence. [▷ Verify this!]
□
Corollary L14.1. An important corollary of the three-point inequality is obtained by
setting 𝑦 = 𝑥. In that case, the three-point inequality simplifies to
⟨−𝑔, 𝑥 − 𝑥′⟩ ≤ −D𝜑(𝑥 ‖ 𝑥′) − D𝜑(𝑥′ ‖ 𝑥).
Corollary L14.2. Continuing Corollary L14.1 by using the strong convexity bound (see
Theorem L14.1) D𝜑(𝑥 ‖ 𝑥′) + D𝜑(𝑥′ ‖ 𝑥) ≥ 𝜇‖𝑥 − 𝑥′‖2 to bound the right-hand side, and
the generalized Cauchy-Schwarz inequality to bound the left-hand side, we find a bound
on the norm of the proximal step:
‖𝑥′ − 𝑥‖ ≤ 1
𝜇‖𝑔‖∗.
L14.3 Analysis of mirror descent
As we have discussed in Lectures 12 and 13, the analysis of gradient descent (and its many
variants and generalizations) typically goes through two fundamental—and conceptually
complementary—lemmas:
• the gradient descent lemma, stating that
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜂
2 ‖∇𝑓(𝑥𝑡)‖2
2
when 𝑓 is 𝐿-smooth and 0 < 𝜂 ≤ 1
𝐿 ; and
• the Euclidean mirror descent lemma, which states that
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + 1
2𝜂 (‖𝑦 − 𝑥𝑡‖2
2 − ‖𝑦 − 𝑥𝑡+1‖2
2 + ‖𝑥𝑡+1 − 𝑥𝑡‖2
2) ∀𝑦 ∈ ℝ𝑛
for convex 𝑓 and arbitrary stepsize 𝜂 > 0.
We now show that with little effort, we can generalize those results to the case of the mirror
descent algorithm.
L14.3.1 Generalizing the gradient descent lemma
We start by generalizing the gradient descent lemma.
Theorem L14.4. Let 𝑓 : Ω → ℝ be 𝐿-smooth with respect to the norm ‖·‖ for which 𝜑 is
strongly convex, and 0 < 𝜂 ≤ 𝜇
𝐿 . Each step of the mirror descent algorithm (3) satisfies
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 𝜇
2𝜂 ‖𝑥𝑡 − 𝑥𝑡+1‖2.
Proof. From the quadratic upper bound, we have
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡+1 − 𝑥𝑡⟩ + 𝐿
2 ‖𝑥𝑡 − 𝑥𝑡+1‖2
Using Corollary L14.1 we therefore find
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑥𝑡) − 1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡) − 1
𝜂 D𝜑(𝑥𝑡 ‖ 𝑥𝑡+1) + 𝐿
2 ‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ 𝑓(𝑥𝑡) + ( 𝐿
2 − 𝜇
𝜂 )‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ 𝑓(𝑥𝑡) − 𝜇
2𝜂 ‖𝑥𝑡 − 𝑥𝑡+1‖2,
which is the statement. □
As expected, we find that the decrease in function value is monotonic, just like in the
unconstrained case.
L14.3.2 The “full” mirror descent lemma
We continue by generaling the Euclidean mirror descent lemma to its fully general version
for arbitrary Bregman divergences. In particular, from Theorem L14.3 we have the following.
Theorem L14.5. Let 𝑓 : Ω → ℝ be convex. Each step of the mirror descent algorithm
(3) satisfies
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡 − 𝑥𝑡+1⟩ − 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡+1) + 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡) − 1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡).
Proof. Using the linear lower bound property of convex functions (Lecture 4), we can write
𝑓(𝑥𝑡) ≤ 𝑓(𝑦) − ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡⟩
= 𝑓(𝑦) + ⟨∇𝑓(𝑥𝑡), 𝑥𝑡 − 𝑥𝑡+1⟩ − ⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡+1⟩.
On the other hand, from Theorem L14.3 applied to the mirror descent step (that is, for
the choices 𝑔 = 𝜂∇𝑓(𝑥𝑡), 𝑥′ = 𝑥𝑡+1, 𝑥 = 𝑥𝑡), we have
−𝜂⟨∇𝑓(𝑥𝑡), 𝑦 − 𝑥𝑡+1⟩ ≤ −D𝜑(𝑦 ‖ 𝑥𝑡+1) + D𝜑(𝑦 ‖ 𝑥𝑡) − D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡).
Hence, by dividing by 𝜂 and substituting into the previous inequality, we obtain the
statement. □
Observe that when 𝜑 is the square Euclidean norm, we recover exactly the Euclidean mirror
descent lemma in the unconstrained case, upon substituting ∇𝑓(𝑥𝑡) = 1
𝜂 (𝑥𝑡 − 𝑥𝑡+1).
L14.3.3 Convergence guarantees for 𝐿-smooth functions
If the function is convex and 𝐿-smooth with respect to the norm ‖·‖ for which 𝜑 is strongly
convex, and 0 < 𝜂 ≤ 𝜇
𝐿 , we can substitute the quadratic upper bound
⟨∇𝑓(𝑥𝑡), 𝑥𝑡 − 𝑥𝑡+1⟩ ≤ 𝑓(𝑥𝑡) − 𝑓(𝑥𝑡+1) + 𝐿
2 ‖𝑥𝑡 − 𝑥𝑡+1‖2
into the mirror descent lemma (Theorem L14.5), obtaining
𝑓(𝑥𝑡+1) ≤ 𝑓(𝑦) − 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡+1) + 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡) − 1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡) + 𝐿
2 ‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ 𝑓(𝑦) − 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡+1) + 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡) − 𝜇
2𝜂 ‖𝑥𝑡+1 − 𝑥𝑡‖2
2 + 𝐿
2 ‖𝑥𝑡 − 𝑥𝑡+1‖2
≤ 𝑓(𝑦) − 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡+1) + 1
𝜂 D𝜑(𝑦 ‖ 𝑥𝑡).
Following the same steps as Lecture 12, telescoping and using the monotonicity of 𝑓(𝑥𝑡)
proved in Theorem L14.5 we obtain the following guarantee.
Theorem L14.6. Let 𝑓 : Ω → ℝ be convex and 𝐿-smooth with respect to the norm ‖·‖
for which 𝜑 is strongly convex. Furthemore, let 0 < 𝜂 ≤ 𝜇
𝐿 , and 𝑥⋆ ∈ Ω be a minimizer
of the function. Then, at any 𝑡, the iterate 𝑥𝑡 produced by the mirror descent algorithm
satisfies
𝑓(𝑥𝑡) − 𝑓(𝑥⋆) ≤ D𝜑(𝑥⋆ ‖ 𝑥0)
𝜂𝑡 .
L14.4 Further readings
More in-detail treatments of the mirror descent algorithm can be found in several standard
resources, including the nice monograph by Bubeck, S. [Bub15].
[Bub15] Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Founda-
tions and Trends in Machine Learning, 8(3–4), 231–357. https://doi.org/10.
1561/2200000050
Changelog
• Apr 10, 2025: Fixed reference to Lecture 4.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
¹In this case, the set Ω is not closed, so the existence of the proximal step does not follow quite as directly.
However, we can still show it using elementary arguments; we already encountered this in Lecture 2 and in
PSet 1.

Content

Typo or question?

Metadata