MIT 6.7220/15.084 โ Nonlinear Optimization (Spring โ25) Apr 8โ10th 2025
Lecture 14
Projected gradient descent and mirror descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
We continue our exploration of first-order methods by considering the case of constrained
optimization problems of the form
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ ฮฉ โ โ๐,
where ๐ is differentiable and ฮฉ is a closed and convex set (with only one technical exception
that we will highlight later).
L14.1 Projected gradient descent
When applied without modifications to a constrained optimization problem, the gradient
descent algorithm might quickly produce iterates ๐ฅ๐ก that leave the feasible set. The idea
behind projected gradient descent is very intuitive: to avoid infeasible iterates, project the
iterates of gradient descent back onto ฮฉ at every iteration. This leads to the update rule
๐ฅ๐ก+1 โ ฮ ฮฉ(๐ฅ๐ก โ ๐โ๐(๐ฅ๐ก)) , (1)
where the operation ฮ ฮฉ denotes Euclidean projection onto ฮฉ. (Remember that projection
onto a closed convex set exists and is unique.)
As it turns out, the projected gradient descent algorithm behaves fundamentally like the
gradient descent algorithm. In particular, the gradient descent lemma and the Euclidean
mirror descent lemma can be generalized to projected gradient descent with little effort.
As a corollary, the same convergence rate of ๐(๐ฅ๐ก) โ ๐(๐ฅโ) โค 1
๐๐ก โ๐ฅ0 โ ๐ฅโโ2
2 for ๐ฟ-smooth
functions when 0 < ๐ โค 1
๐ฟ applies to projected gradient descent as well.
Instead of developing the results for projected gradient descent, we introduce a generaliza-
tion of the algorithm that is more permissive to the projection notion used. The correctness
of the projected gradient descent will then follow as a direct corollary of this generalization.
L14.2 Generalized projection: The proximal step
Depending on the feasible set ฮฉ, Euclidean projections onto ฮฉ might be expensive to
compute in practice. A generalization of projected gradient descent called mirror descent
affords more flexibility in the notion of distance used in the projection.
L14.2.1 Distance-generating functions (DGFs)
An interesting generalization of the distance between two points is that of a Bregman
divergence (also called Bregman distance). A Bregman divergence is not a distance in the
technical senseโfor example, it is not necessarily symmetric, and it need not satisfy the
triangle inequality.
A Bregman divergence is constructed starting from a strongly convex function ๐, called the
distance-generating function (DGF) for the divergence.
Definition L14.1. Let ๐ : ฮฉ โ โ be a differentiable and ๐-strongly convex (๐ > 0)
function with respect to a norm โยทโ, that is, satisfy
๐(๐ฅ) โฅ ๐(๐ฆ) + โจโ๐(๐ฆ), ๐ฅ โ ๐ฆโฉ + ๐
2 โ๐ฅ โ ๐ฆโ2 โ๐ฅ, ๐ฆ โ ฮฉ.
The Bregman divergence centered in ๐ฆ โ ฮฉ is the function D๐(๐ฅ โ ๐ฆ) defined as
D๐(๐ฅ โ ๐ฆ) โ ๐(๐ฅ) โ ๐(๐ฆ) โ โจโ๐(๐ฆ), ๐ฅ โ ๐ฆโฉ.
Note that from its very definition it is clear that
D๐(๐ฅ โ ๐ฅ) = 0 โ๐ฅ โ ฮฉ and D๐(๐ฅ โ ๐ฆ) โฅ ๐
2 โ๐ฅ โ ๐ฆโ2 โ๐ฅ, ๐ฆ โ ฮฉ, (2)
which in particular implies that D๐(๐ฅ โ ๐ฆ) = 0 if and only if ๐ฅ = ๐ฆ. We now mention two
very important special cases of Bregman divergences.
โข When ฮฉ is arbitrary and the DGF ๐ is set to be the is the squared Euclidean norm
๐(๐ฅ) โ 1
2 โ๐ฅโ2
2,
which is 1-strongly convex with respect to โยทโ2, the corresponding Bregman divergence
is the squared Euclidean distance
D๐(๐ฅ โ ๐ฆ) = 1
2 โ๐ฅ โ ๐ฆโ2
2.
Indeed, from the definition,
D๐(๐ฅ โ ๐ฆ) = 1
2โ๐ฅโ2
2 โ 1
2โ๐ฆโ2
2 โ โจ๐ฆ, ๐ฅ โ ๐ฆโฉ = 1
2โ๐ฅโ2
2 โ 1
2โ๐ฆโ2
2 โ โจ๐ฆ, ๐ฅโฉ + โ๐ฆโ2
2 = 1
2โ๐ฅ โ ๐ฆโ2
2.
โข When ฮฉ = ฬ ฮ๐ is the set of full-support distributions over ๐ objects,ยน and the distance-
generating function ๐ is set to the negative entropy function
๐(๐ฅ) โ โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐,
which is 1-strongly convex with respect to the โ1 norm โยทโ1, [โท You should check this!]
the corresponding Bregman divergence is the Kullback-Leibler (KL) divergence [โท And
this too!]
D๐(๐ฅ โ ๐ฆ) = โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐
๐ฆ๐
,
a commonly used notion of distance between distributions in machine learning and
statistics.
A useful fact about Bregman divergences is that for any center ๐ฆ, they are as strongly
convex in ๐ฅ as the original distance-generating function ๐. More precisely, we have the
following.
Theorem L14.1. Let ๐ : ฮฉ โ โ be differentiable and ๐-strongly convex with respect to
a norm โยทโ. For any ๐ฆ โ ฮฉ, the function ๐ฅ โฆ D๐(๐ฅ โ ๐ฆ) is ๐-strongly convex with respect
to โยทโ, that is,
D๐(๐ฅโฒ โ ๐ฆ) โฅ D๐(๐ฅ โ ๐ฆ) + โจโ๐ฅD๐(๐ฅ โ ๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 โ๐ฅ, ๐ฅโฒ โ ฮฉ.
Proof. Using the definition of the Bregman divergence D๐(ยท โ ยท), we have
โ๐ฅD๐(๐ฅ โ ๐ฆ) = โ๐(๐ฅ) โ โ๐(๐ฆ),
so after expanding the definition of D๐(ยท โ ยท) in the inequality of the statement, the
statement is
๐(๐ฅโฒ) โฅ ๐(๐ฅ) + โจโ๐(๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + โจโ๐(๐ฅ) โ โ๐(๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 โ๐ฅ, ๐ฅโฒ โ ฮฉ,
that is, ๐(๐ฅโฒ) โฅ ๐(๐ฅ) + โจโ๐(๐ฅ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 for all ๐ฅ, ๐ฅโฒ โ ฮฉ, which follows by
the assumption of ๐-strong convexity with respect to โยทโ for ๐. โก
L14.2.2 Proximal steps
Proximal steps generalize the steps followed by the projected gradient descent algorithm
(1). The key insight is the following: instead of interpreting (1) as the projection of the point
๐ฅ๐ก โ ๐โ๐(๐ฅ๐ก), we can interpret ๐ฅ๐ก+1 as just another manifestation of the key principle of
gradient descent-type algorithms: hoping that the objective can be approximated well with
its first-order Taylor expansion in a neighborhood of each point. It then follows naturally
that each updated point ๐ฅ๐ก+1 produced by a gradient descent-type algorithm should trade
off two competing objectives:
โข moving as much as possible in the direction โโ๐(๐ฅ๐ก); and
โข staying in a neighborhood of ฮฉ centered around point ๐ฅ๐ก, so as to not move too far.
The stepsize parameter ๐ > 0 controls the tradeoff between the competing objectives.
When using a generic Bregman divergence D๐(ยท โ ยท) as the notion of distance, the tradeoff
between these two competing objectives can be formalized as the proximal step problem
Prox๐(๐โ๐(๐ฅ๐ก), ๐ฅ๐ก) โ arg min
๐ฅ ๐โจโ๐(๐ฅ๐ก), ๐ฅโฉ + D๐(๐ฅ โ ๐ฅ๐ก)
s.t. ๐ฅ โ ฮฉ.
We show in the next subsection that proximal steps are well-definedโthat is, the solution
to the optimization problem above exists and is unique. This leads to the mirror descent
algorithm, defined by the update
๐ฅ๐ก+1 โ Prox๐(๐โ๐(๐ฅ๐ก), ๐ฅ๐ก) . (3)
โ The Euclidean DGF recovers Euclidean projection. As a sanity check to convince ourselves
that the abstraction of proximal step is reasonable, we can verify that it generalizes the
steps of projected gradient descent (1)โand therefore also of gradient descent, which is
just projected gradient descent in which ฮฉ = โ๐. We do so in the next theorem.
Theorem L14.2. Consider the squared Euclidean norm distance-generating function
๐(๐ฅ) = 1
2 โ๐ฅโ2
2. Then, proximal steps and projected gradient steps (1) are equivalent:
Prox๐(๐โ๐(๐ฅ), ๐ฅ) = ฮ ฮฉ(๐ฅ โ ๐โ๐(๐ฅ)) โ๐ฅ โ ฮฉ.
Proof. The Euclidean projection problem is given by
min
๐ฆ
s.t.
1
2 โ๐ฆ โ ๐ฅ + ๐โ๐(๐ฅ)โ2
2
๐ฆ โ ฮฉ.
Expanding the squared Euclidean norm in the objective and removing terms that do not
depend on the optimization variable ๐ฆ we can rewrite the problem as
min
๐ฆ
s.t.
1
2 โ๐ฆ โ ๐ฅโ2
2 + ๐โจโ๐(๐ฅ), ๐ฆ โ ๐ฅโฉ
๐ฆ โ ฮฉ,
which is exactly the proximal step problem since D๐(๐ฆ โ ๐ฅ) = 1
2 โ๐ฆ โ ๐ฅโ2
2 as observed in
the previous section. โก
โ The negative entropy DGF recovers the softmax update. Proximal steps are very useful
when computing Euclidean projections is expensive. For example, in the case of the negative
entropy distance-generating function for full-support distributions, we can use the result in
Lecture 2 to show that the proximal step corresponds to the softmax update
๐ฅ๐ก+1 โ ๐ฅ๐ก โ exp{โ๐โ๐(๐ฅ๐ก)},
where โ denotes elementwise product. [โท Try to work out the details!] Such a generalized
notion of projection is significantly more practical than the algorithm for Euclidean projec-
tion you developed in Homework 1.
L14.2.3 Properties of proximal steps
We now mention a few important properties of proximal steps.
โ Proximal steps exist. The argument is analogous to the one we used in Lecture 1 to
argue the existence of Euclidean projections. Consider a generic proximal step Prox๐(๐, ๐ฆ),
in which the objective function to be minimized over ๐ฅ โ ฮฉ is accordingly
โ(๐ฅ) โ โจ๐, ๐ฅโฉ + D๐(๐ฅ โ ๐ฆ).
The infimum of the function over ฮฉ must be less than or equal to the value of the objective
in the valid choice ๐ฅ = ๐ฆ.
To apply the Weierstrass theorem, we must show that we can safely restrict the domain to
a compact subset. To do so, we can use the knowledge, from (2), that D๐(๐ฅ โ ๐ฆ) โฅ ๐
2 โ๐ฅ โ ๐ฆโ2
for all ๐ฅ โ ฮฉ, where ๐ > 0 and โยทโ are the strong convexity parameter and strong convexity
norm of the underlying DGF ๐. The value of the increment โ(๐ฅ) โ โ(๐ฆ) can therefore be
lower-bounded using the generalized Cauchy-Schwarz inequality as
โ(๐ฅ) โ โ(๐ฆ) โฅ โจ๐, ๐ฅ โ ๐ฆโฉ + ๐
2 โ๐ฅ โ ๐ฆโ2 โฅ ๐
2 โ๐ฅ โ ๐ฆโ โ (โ๐ฅ โ ๐ฆโ โ 2
๐โ๐โโ).
This shows that any point ๐ฅ such that โ๐ฅ โ ๐ฆโ โฅ 2
๐ โ๐โโ we have that โ(๐ฅ) โฅ โ(๐ฆ). Therefore,
we can restrict the minimization of โ(๐ฅ) to the compact set defined by the intersection
between ฮฉ and the closed ball of radius โ๐โโ centered in ๐ฆ, and Weierstrass guarantees the
existence of a minimizer of ๐ in this compact restriction of the domain.
โ Proximal steps are unique. The objective function minimized in proximal steps is
defined as the sum of a linear function plus a Bregman divergence with a fixed center. Since
Bregman divergences are strongly convex by Theorem L14.1, and linear terms do not affect
strong convexity, the proximal step problem minimizes a strongly convex objective on a
convex set. The uniqueness of the solution is therefore guaranteed (see Lecture 4).
โ The three-point equality for proximal steps. The following property is key in many
proofs involving proximal steps. For that reason, we give it a name.
Theorem L14.3 (Three-point inequality for proximal steps). Consider a generic proximal
set
๐ฅโฒ = Prox๐(๐, ๐ฅ).
Then,
โจโ๐, ๐ฆ โ ๐ฅโฒโฉ โค โD๐(๐ฆ โ ๐ฅโฒ) + D๐(๐ฆ โ ๐ฅ) โ D๐(๐ฅโฒ โ ๐ฅ) โ๐ฆ โ ฮฉ.
Proof. The objective function of the proximal step problem is given by
โ(๐ง) โ โจ๐, ๐งโฉ + D๐(๐ง โ ๐ฅ), ๐ง โ ฮฉ.
The first-order optimality conditions applied to the solution ๐ง = ๐ฅโฒ are therefore
โโโ(๐ฅโฒ) โ ๐ฉฮฉ(๐ฅโฒ) โบ โ๐ โ โ๐(๐ฅโฒ) + โ๐(๐ฅ) โ ๐ฉฮฉ(๐ฅโฒ)
โบ โจโ๐ โ โ๐(๐ฅโฒ) + โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ โค 0 โ๐ฆ โ ฮฉ
โบ โจโ๐, ๐ฆ โ ๐ฅโฒโฉ โค โจโ๐(๐ฅโฒ) โ โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ โ๐ฆ โ ฮฉ.
The statement now follows by using the identity
โจโ๐(๐ฅโฒ) โ โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ = โD๐(๐ฆ โ ๐ฅโฒ) + D๐(๐ฆ โ ๐ฅ) โ D๐(๐ฅโฒ โ ๐ฅ),
which can be checked directly from the definition of Bregman divergence. [โท Verify this!]
โก
Corollary L14.1. An important corollary of the three-point inequality is obtained by
setting ๐ฆ = ๐ฅ. In that case, the three-point inequality simplifies to
โจโ๐, ๐ฅ โ ๐ฅโฒโฉ โค โD๐(๐ฅ โ ๐ฅโฒ) โ D๐(๐ฅโฒ โ ๐ฅ).
Corollary L14.2. Continuing Corollary L14.1 by using the strong convexity bound (see
Theorem L14.1) D๐(๐ฅ โ ๐ฅโฒ) + D๐(๐ฅโฒ โ ๐ฅ) โฅ ๐โ๐ฅ โ ๐ฅโฒโ2 to bound the right-hand side, and
the generalized Cauchy-Schwarz inequality to bound the left-hand side, we find a bound
on the norm of the proximal step:
โ๐ฅโฒ โ ๐ฅโ โค 1
๐โ๐โโ.
L14.3 Analysis of mirror descent
As we have discussed in Lectures 12 and 13, the analysis of gradient descent (and its many
variants and generalizations) typically goes through two fundamentalโand conceptually
complementaryโlemmas:
โข the gradient descent lemma, stating that
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2 โโ๐(๐ฅ๐ก)โ2
2
when ๐ is ๐ฟ-smooth and 0 < ๐ โค 1
๐ฟ ; and
โข the Euclidean mirror descent lemma, which states that
๐(๐ฅ๐ก) โค ๐(๐ฆ) + 1
2๐ (โ๐ฆ โ ๐ฅ๐กโ2
2 โ โ๐ฆ โ ๐ฅ๐ก+1โ2
2 + โ๐ฅ๐ก+1 โ ๐ฅ๐กโ2
2) โ๐ฆ โ โ๐
for convex ๐ and arbitrary stepsize ๐ > 0.
We now show that with little effort, we can generalize those results to the case of the mirror
descent algorithm.
L14.3.1 Generalizing the gradient descent lemma
We start by generalizing the gradient descent lemma.
Theorem L14.4. Let ๐ : ฮฉ โ โ be ๐ฟ-smooth with respect to the norm โยทโ for which ๐ is
strongly convex, and 0 < ๐ โค ๐
๐ฟ . Each step of the mirror descent algorithm (3) satisfies
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2.
Proof. From the quadratic upper bound, we have
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
Using Corollary L14.1 we therefore find
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก โ ๐ฅ๐ก+1) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฅ๐ก) + ( ๐ฟ
2 โ ๐
๐ )โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2,
which is the statement. โก
As expected, we find that the decrease in function value is monotonic, just like in the
unconstrained case.
L14.3.2 The โfullโ mirror descent lemma
We continue by generaling the Euclidean mirror descent lemma to its fully general version
for arbitrary Bregman divergences. In particular, from Theorem L14.3 we have the following.
Theorem L14.5. Let ๐ : ฮฉ โ โ be convex. Each step of the mirror descent algorithm
(3) satisfies
๐(๐ฅ๐ก) โค ๐(๐ฆ) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก).
Proof. Using the linear lower bound property of convex functions (Lecture 4), we can write
๐(๐ฅ๐ก) โค ๐(๐ฆ) โ โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ
= ๐(๐ฆ) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โ โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐ก+1โฉ.
On the other hand, from Theorem L14.3 applied to the mirror descent step (that is, for
the choices ๐ = ๐โ๐(๐ฅ๐ก), ๐ฅโฒ = ๐ฅ๐ก+1, ๐ฅ = ๐ฅ๐ก), we have
โ๐โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐ก+1โฉ โค โD๐(๐ฆ โ ๐ฅ๐ก+1) + D๐(๐ฆ โ ๐ฅ๐ก) โ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก).
Hence, by dividing by ๐ and substituting into the previous inequality, we obtain the
statement. โก
Observe that when ๐ is the square Euclidean norm, we recover exactly the Euclidean mirror
descent lemma in the unconstrained case, upon substituting โ๐(๐ฅ๐ก) = 1
๐ (๐ฅ๐ก โ ๐ฅ๐ก+1).
L14.3.3 Convergence guarantees for ๐ฟ-smooth functions
If the function is convex and ๐ฟ-smooth with respect to the norm โยทโ for which ๐ is strongly
convex, and 0 < ๐ โค ๐
๐ฟ , we can substitute the quadratic upper bound
โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โค ๐(๐ฅ๐ก) โ ๐(๐ฅ๐ก+1) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
into the mirror descent lemma (Theorem L14.5), obtaining
๐(๐ฅ๐ก+1) โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก+1 โ ๐ฅ๐กโ2
2 + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก).
Following the same steps as Lecture 12, telescoping and using the monotonicity of ๐(๐ฅ๐ก)
proved in Theorem L14.5 we obtain the following guarantee.
Theorem L14.6. Let ๐ : ฮฉ โ โ be convex and ๐ฟ-smooth with respect to the norm โยทโ
for which ๐ is strongly convex. Furthemore, let 0 < ๐ โค ๐
๐ฟ , and ๐ฅโ โ ฮฉ be a minimizer
of the function. Then, at any ๐ก, the iterate ๐ฅ๐ก produced by the mirror descent algorithm
satisfies
๐(๐ฅ๐ก) โ ๐(๐ฅโ) โค D๐(๐ฅโ โ ๐ฅ0)
๐๐ก .
L14.4 Further readings
More in-detail treatments of the mirror descent algorithm can be found in several standard
resources, including the nice monograph by Bubeck, S. [Bub15].
[Bub15] Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Founda-
tions and Trends in Machine Learning, 8(3โ4), 231โ357. https://doi.org/10.
1561/2200000050
Changelog
โข Apr 10, 2025: Fixed reference to Lecture 4.
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนIn this case, the set ฮฉ is not closed, so the existence of the proximal step does not follow quite as directly.
However, we can still show it using elementary arguments; we already encountered this in Lecture 2 and in
PSet 1.
Lecture 14
Projected gradient descent and mirror descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
We continue our exploration of first-order methods by considering the case of constrained
optimization problems of the form
min
๐ฅ
s.t.
๐(๐ฅ)
๐ฅ โ ฮฉ โ โ๐,
where ๐ is differentiable and ฮฉ is a closed and convex set (with only one technical exception
that we will highlight later).
L14.1 Projected gradient descent
When applied without modifications to a constrained optimization problem, the gradient
descent algorithm might quickly produce iterates ๐ฅ๐ก that leave the feasible set. The idea
behind projected gradient descent is very intuitive: to avoid infeasible iterates, project the
iterates of gradient descent back onto ฮฉ at every iteration. This leads to the update rule
๐ฅ๐ก+1 โ ฮ ฮฉ(๐ฅ๐ก โ ๐โ๐(๐ฅ๐ก)) , (1)
where the operation ฮ ฮฉ denotes Euclidean projection onto ฮฉ. (Remember that projection
onto a closed convex set exists and is unique.)
As it turns out, the projected gradient descent algorithm behaves fundamentally like the
gradient descent algorithm. In particular, the gradient descent lemma and the Euclidean
mirror descent lemma can be generalized to projected gradient descent with little effort.
As a corollary, the same convergence rate of ๐(๐ฅ๐ก) โ ๐(๐ฅโ) โค 1
๐๐ก โ๐ฅ0 โ ๐ฅโโ2
2 for ๐ฟ-smooth
functions when 0 < ๐ โค 1
๐ฟ applies to projected gradient descent as well.
Instead of developing the results for projected gradient descent, we introduce a generaliza-
tion of the algorithm that is more permissive to the projection notion used. The correctness
of the projected gradient descent will then follow as a direct corollary of this generalization.
L14.2 Generalized projection: The proximal step
Depending on the feasible set ฮฉ, Euclidean projections onto ฮฉ might be expensive to
compute in practice. A generalization of projected gradient descent called mirror descent
affords more flexibility in the notion of distance used in the projection.
L14.2.1 Distance-generating functions (DGFs)
An interesting generalization of the distance between two points is that of a Bregman
divergence (also called Bregman distance). A Bregman divergence is not a distance in the
technical senseโfor example, it is not necessarily symmetric, and it need not satisfy the
triangle inequality.
A Bregman divergence is constructed starting from a strongly convex function ๐, called the
distance-generating function (DGF) for the divergence.
Definition L14.1. Let ๐ : ฮฉ โ โ be a differentiable and ๐-strongly convex (๐ > 0)
function with respect to a norm โยทโ, that is, satisfy
๐(๐ฅ) โฅ ๐(๐ฆ) + โจโ๐(๐ฆ), ๐ฅ โ ๐ฆโฉ + ๐
2 โ๐ฅ โ ๐ฆโ2 โ๐ฅ, ๐ฆ โ ฮฉ.
The Bregman divergence centered in ๐ฆ โ ฮฉ is the function D๐(๐ฅ โ ๐ฆ) defined as
D๐(๐ฅ โ ๐ฆ) โ ๐(๐ฅ) โ ๐(๐ฆ) โ โจโ๐(๐ฆ), ๐ฅ โ ๐ฆโฉ.
Note that from its very definition it is clear that
D๐(๐ฅ โ ๐ฅ) = 0 โ๐ฅ โ ฮฉ and D๐(๐ฅ โ ๐ฆ) โฅ ๐
2 โ๐ฅ โ ๐ฆโ2 โ๐ฅ, ๐ฆ โ ฮฉ, (2)
which in particular implies that D๐(๐ฅ โ ๐ฆ) = 0 if and only if ๐ฅ = ๐ฆ. We now mention two
very important special cases of Bregman divergences.
โข When ฮฉ is arbitrary and the DGF ๐ is set to be the is the squared Euclidean norm
๐(๐ฅ) โ 1
2 โ๐ฅโ2
2,
which is 1-strongly convex with respect to โยทโ2, the corresponding Bregman divergence
is the squared Euclidean distance
D๐(๐ฅ โ ๐ฆ) = 1
2 โ๐ฅ โ ๐ฆโ2
2.
Indeed, from the definition,
D๐(๐ฅ โ ๐ฆ) = 1
2โ๐ฅโ2
2 โ 1
2โ๐ฆโ2
2 โ โจ๐ฆ, ๐ฅ โ ๐ฆโฉ = 1
2โ๐ฅโ2
2 โ 1
2โ๐ฆโ2
2 โ โจ๐ฆ, ๐ฅโฉ + โ๐ฆโ2
2 = 1
2โ๐ฅ โ ๐ฆโ2
2.
โข When ฮฉ = ฬ ฮ๐ is the set of full-support distributions over ๐ objects,ยน and the distance-
generating function ๐ is set to the negative entropy function
๐(๐ฅ) โ โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐,
which is 1-strongly convex with respect to the โ1 norm โยทโ1, [โท You should check this!]
the corresponding Bregman divergence is the Kullback-Leibler (KL) divergence [โท And
this too!]
D๐(๐ฅ โ ๐ฆ) = โ
๐
๐=1
๐ฅ๐ log ๐ฅ๐
๐ฆ๐
,
a commonly used notion of distance between distributions in machine learning and
statistics.
A useful fact about Bregman divergences is that for any center ๐ฆ, they are as strongly
convex in ๐ฅ as the original distance-generating function ๐. More precisely, we have the
following.
Theorem L14.1. Let ๐ : ฮฉ โ โ be differentiable and ๐-strongly convex with respect to
a norm โยทโ. For any ๐ฆ โ ฮฉ, the function ๐ฅ โฆ D๐(๐ฅ โ ๐ฆ) is ๐-strongly convex with respect
to โยทโ, that is,
D๐(๐ฅโฒ โ ๐ฆ) โฅ D๐(๐ฅ โ ๐ฆ) + โจโ๐ฅD๐(๐ฅ โ ๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 โ๐ฅ, ๐ฅโฒ โ ฮฉ.
Proof. Using the definition of the Bregman divergence D๐(ยท โ ยท), we have
โ๐ฅD๐(๐ฅ โ ๐ฆ) = โ๐(๐ฅ) โ โ๐(๐ฆ),
so after expanding the definition of D๐(ยท โ ยท) in the inequality of the statement, the
statement is
๐(๐ฅโฒ) โฅ ๐(๐ฅ) + โจโ๐(๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + โจโ๐(๐ฅ) โ โ๐(๐ฆ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 โ๐ฅ, ๐ฅโฒ โ ฮฉ,
that is, ๐(๐ฅโฒ) โฅ ๐(๐ฅ) + โจโ๐(๐ฅ), ๐ฅโฒ โ ๐ฅโฉ + ๐
2 โ๐ฅโฒ โ ๐ฅโ2 for all ๐ฅ, ๐ฅโฒ โ ฮฉ, which follows by
the assumption of ๐-strong convexity with respect to โยทโ for ๐. โก
L14.2.2 Proximal steps
Proximal steps generalize the steps followed by the projected gradient descent algorithm
(1). The key insight is the following: instead of interpreting (1) as the projection of the point
๐ฅ๐ก โ ๐โ๐(๐ฅ๐ก), we can interpret ๐ฅ๐ก+1 as just another manifestation of the key principle of
gradient descent-type algorithms: hoping that the objective can be approximated well with
its first-order Taylor expansion in a neighborhood of each point. It then follows naturally
that each updated point ๐ฅ๐ก+1 produced by a gradient descent-type algorithm should trade
off two competing objectives:
โข moving as much as possible in the direction โโ๐(๐ฅ๐ก); and
โข staying in a neighborhood of ฮฉ centered around point ๐ฅ๐ก, so as to not move too far.
The stepsize parameter ๐ > 0 controls the tradeoff between the competing objectives.
When using a generic Bregman divergence D๐(ยท โ ยท) as the notion of distance, the tradeoff
between these two competing objectives can be formalized as the proximal step problem
Prox๐(๐โ๐(๐ฅ๐ก), ๐ฅ๐ก) โ arg min
๐ฅ ๐โจโ๐(๐ฅ๐ก), ๐ฅโฉ + D๐(๐ฅ โ ๐ฅ๐ก)
s.t. ๐ฅ โ ฮฉ.
We show in the next subsection that proximal steps are well-definedโthat is, the solution
to the optimization problem above exists and is unique. This leads to the mirror descent
algorithm, defined by the update
๐ฅ๐ก+1 โ Prox๐(๐โ๐(๐ฅ๐ก), ๐ฅ๐ก) . (3)
โ The Euclidean DGF recovers Euclidean projection. As a sanity check to convince ourselves
that the abstraction of proximal step is reasonable, we can verify that it generalizes the
steps of projected gradient descent (1)โand therefore also of gradient descent, which is
just projected gradient descent in which ฮฉ = โ๐. We do so in the next theorem.
Theorem L14.2. Consider the squared Euclidean norm distance-generating function
๐(๐ฅ) = 1
2 โ๐ฅโ2
2. Then, proximal steps and projected gradient steps (1) are equivalent:
Prox๐(๐โ๐(๐ฅ), ๐ฅ) = ฮ ฮฉ(๐ฅ โ ๐โ๐(๐ฅ)) โ๐ฅ โ ฮฉ.
Proof. The Euclidean projection problem is given by
min
๐ฆ
s.t.
1
2 โ๐ฆ โ ๐ฅ + ๐โ๐(๐ฅ)โ2
2
๐ฆ โ ฮฉ.
Expanding the squared Euclidean norm in the objective and removing terms that do not
depend on the optimization variable ๐ฆ we can rewrite the problem as
min
๐ฆ
s.t.
1
2 โ๐ฆ โ ๐ฅโ2
2 + ๐โจโ๐(๐ฅ), ๐ฆ โ ๐ฅโฉ
๐ฆ โ ฮฉ,
which is exactly the proximal step problem since D๐(๐ฆ โ ๐ฅ) = 1
2 โ๐ฆ โ ๐ฅโ2
2 as observed in
the previous section. โก
โ The negative entropy DGF recovers the softmax update. Proximal steps are very useful
when computing Euclidean projections is expensive. For example, in the case of the negative
entropy distance-generating function for full-support distributions, we can use the result in
Lecture 2 to show that the proximal step corresponds to the softmax update
๐ฅ๐ก+1 โ ๐ฅ๐ก โ exp{โ๐โ๐(๐ฅ๐ก)},
where โ denotes elementwise product. [โท Try to work out the details!] Such a generalized
notion of projection is significantly more practical than the algorithm for Euclidean projec-
tion you developed in Homework 1.
L14.2.3 Properties of proximal steps
We now mention a few important properties of proximal steps.
โ Proximal steps exist. The argument is analogous to the one we used in Lecture 1 to
argue the existence of Euclidean projections. Consider a generic proximal step Prox๐(๐, ๐ฆ),
in which the objective function to be minimized over ๐ฅ โ ฮฉ is accordingly
โ(๐ฅ) โ โจ๐, ๐ฅโฉ + D๐(๐ฅ โ ๐ฆ).
The infimum of the function over ฮฉ must be less than or equal to the value of the objective
in the valid choice ๐ฅ = ๐ฆ.
To apply the Weierstrass theorem, we must show that we can safely restrict the domain to
a compact subset. To do so, we can use the knowledge, from (2), that D๐(๐ฅ โ ๐ฆ) โฅ ๐
2 โ๐ฅ โ ๐ฆโ2
for all ๐ฅ โ ฮฉ, where ๐ > 0 and โยทโ are the strong convexity parameter and strong convexity
norm of the underlying DGF ๐. The value of the increment โ(๐ฅ) โ โ(๐ฆ) can therefore be
lower-bounded using the generalized Cauchy-Schwarz inequality as
โ(๐ฅ) โ โ(๐ฆ) โฅ โจ๐, ๐ฅ โ ๐ฆโฉ + ๐
2 โ๐ฅ โ ๐ฆโ2 โฅ ๐
2 โ๐ฅ โ ๐ฆโ โ (โ๐ฅ โ ๐ฆโ โ 2
๐โ๐โโ).
This shows that any point ๐ฅ such that โ๐ฅ โ ๐ฆโ โฅ 2
๐ โ๐โโ we have that โ(๐ฅ) โฅ โ(๐ฆ). Therefore,
we can restrict the minimization of โ(๐ฅ) to the compact set defined by the intersection
between ฮฉ and the closed ball of radius โ๐โโ centered in ๐ฆ, and Weierstrass guarantees the
existence of a minimizer of ๐ in this compact restriction of the domain.
โ Proximal steps are unique. The objective function minimized in proximal steps is
defined as the sum of a linear function plus a Bregman divergence with a fixed center. Since
Bregman divergences are strongly convex by Theorem L14.1, and linear terms do not affect
strong convexity, the proximal step problem minimizes a strongly convex objective on a
convex set. The uniqueness of the solution is therefore guaranteed (see Lecture 4).
โ The three-point equality for proximal steps. The following property is key in many
proofs involving proximal steps. For that reason, we give it a name.
Theorem L14.3 (Three-point inequality for proximal steps). Consider a generic proximal
set
๐ฅโฒ = Prox๐(๐, ๐ฅ).
Then,
โจโ๐, ๐ฆ โ ๐ฅโฒโฉ โค โD๐(๐ฆ โ ๐ฅโฒ) + D๐(๐ฆ โ ๐ฅ) โ D๐(๐ฅโฒ โ ๐ฅ) โ๐ฆ โ ฮฉ.
Proof. The objective function of the proximal step problem is given by
โ(๐ง) โ โจ๐, ๐งโฉ + D๐(๐ง โ ๐ฅ), ๐ง โ ฮฉ.
The first-order optimality conditions applied to the solution ๐ง = ๐ฅโฒ are therefore
โโโ(๐ฅโฒ) โ ๐ฉฮฉ(๐ฅโฒ) โบ โ๐ โ โ๐(๐ฅโฒ) + โ๐(๐ฅ) โ ๐ฉฮฉ(๐ฅโฒ)
โบ โจโ๐ โ โ๐(๐ฅโฒ) + โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ โค 0 โ๐ฆ โ ฮฉ
โบ โจโ๐, ๐ฆ โ ๐ฅโฒโฉ โค โจโ๐(๐ฅโฒ) โ โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ โ๐ฆ โ ฮฉ.
The statement now follows by using the identity
โจโ๐(๐ฅโฒ) โ โ๐(๐ฅ), ๐ฆ โ ๐ฅโฒโฉ = โD๐(๐ฆ โ ๐ฅโฒ) + D๐(๐ฆ โ ๐ฅ) โ D๐(๐ฅโฒ โ ๐ฅ),
which can be checked directly from the definition of Bregman divergence. [โท Verify this!]
โก
Corollary L14.1. An important corollary of the three-point inequality is obtained by
setting ๐ฆ = ๐ฅ. In that case, the three-point inequality simplifies to
โจโ๐, ๐ฅ โ ๐ฅโฒโฉ โค โD๐(๐ฅ โ ๐ฅโฒ) โ D๐(๐ฅโฒ โ ๐ฅ).
Corollary L14.2. Continuing Corollary L14.1 by using the strong convexity bound (see
Theorem L14.1) D๐(๐ฅ โ ๐ฅโฒ) + D๐(๐ฅโฒ โ ๐ฅ) โฅ ๐โ๐ฅ โ ๐ฅโฒโ2 to bound the right-hand side, and
the generalized Cauchy-Schwarz inequality to bound the left-hand side, we find a bound
on the norm of the proximal step:
โ๐ฅโฒ โ ๐ฅโ โค 1
๐โ๐โโ.
L14.3 Analysis of mirror descent
As we have discussed in Lectures 12 and 13, the analysis of gradient descent (and its many
variants and generalizations) typically goes through two fundamentalโand conceptually
complementaryโlemmas:
โข the gradient descent lemma, stating that
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2 โโ๐(๐ฅ๐ก)โ2
2
when ๐ is ๐ฟ-smooth and 0 < ๐ โค 1
๐ฟ ; and
โข the Euclidean mirror descent lemma, which states that
๐(๐ฅ๐ก) โค ๐(๐ฆ) + 1
2๐ (โ๐ฆ โ ๐ฅ๐กโ2
2 โ โ๐ฆ โ ๐ฅ๐ก+1โ2
2 + โ๐ฅ๐ก+1 โ ๐ฅ๐กโ2
2) โ๐ฆ โ โ๐
for convex ๐ and arbitrary stepsize ๐ > 0.
We now show that with little effort, we can generalize those results to the case of the mirror
descent algorithm.
L14.3.1 Generalizing the gradient descent lemma
We start by generalizing the gradient descent lemma.
Theorem L14.4. Let ๐ : ฮฉ โ โ be ๐ฟ-smooth with respect to the norm โยทโ for which ๐ is
strongly convex, and 0 < ๐ โค ๐
๐ฟ . Each step of the mirror descent algorithm (3) satisfies
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2.
Proof. From the quadratic upper bound, we have
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก+1 โ ๐ฅ๐กโฉ + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
Using Corollary L14.1 we therefore find
๐(๐ฅ๐ก+1) โค ๐(๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก โ ๐ฅ๐ก+1) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฅ๐ก) + ( ๐ฟ
2 โ ๐
๐ )โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2,
which is the statement. โก
As expected, we find that the decrease in function value is monotonic, just like in the
unconstrained case.
L14.3.2 The โfullโ mirror descent lemma
We continue by generaling the Euclidean mirror descent lemma to its fully general version
for arbitrary Bregman divergences. In particular, from Theorem L14.3 we have the following.
Theorem L14.5. Let ๐ : ฮฉ โ โ be convex. Each step of the mirror descent algorithm
(3) satisfies
๐(๐ฅ๐ก) โค ๐(๐ฆ) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก).
Proof. Using the linear lower bound property of convex functions (Lecture 4), we can write
๐(๐ฅ๐ก) โค ๐(๐ฆ) โ โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐กโฉ
= ๐(๐ฆ) + โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โ โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐ก+1โฉ.
On the other hand, from Theorem L14.3 applied to the mirror descent step (that is, for
the choices ๐ = ๐โ๐(๐ฅ๐ก), ๐ฅโฒ = ๐ฅ๐ก+1, ๐ฅ = ๐ฅ๐ก), we have
โ๐โจโ๐(๐ฅ๐ก), ๐ฆ โ ๐ฅ๐ก+1โฉ โค โD๐(๐ฆ โ ๐ฅ๐ก+1) + D๐(๐ฆ โ ๐ฅ๐ก) โ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก).
Hence, by dividing by ๐ and substituting into the previous inequality, we obtain the
statement. โก
Observe that when ๐ is the square Euclidean norm, we recover exactly the Euclidean mirror
descent lemma in the unconstrained case, upon substituting โ๐(๐ฅ๐ก) = 1
๐ (๐ฅ๐ก โ ๐ฅ๐ก+1).
L14.3.3 Convergence guarantees for ๐ฟ-smooth functions
If the function is convex and ๐ฟ-smooth with respect to the norm โยทโ for which ๐ is strongly
convex, and 0 < ๐ โค ๐
๐ฟ , we can substitute the quadratic upper bound
โจโ๐(๐ฅ๐ก), ๐ฅ๐ก โ ๐ฅ๐ก+1โฉ โค ๐(๐ฅ๐ก) โ ๐(๐ฅ๐ก+1) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
into the mirror descent lemma (Theorem L14.5), obtaining
๐(๐ฅ๐ก+1) โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ 1
๐ D๐(๐ฅ๐ก+1 โ ๐ฅ๐ก) + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก) โ ๐
2๐ โ๐ฅ๐ก+1 โ ๐ฅ๐กโ2
2 + ๐ฟ
2 โ๐ฅ๐ก โ ๐ฅ๐ก+1โ2
โค ๐(๐ฆ) โ 1
๐ D๐(๐ฆ โ ๐ฅ๐ก+1) + 1
๐ D๐(๐ฆ โ ๐ฅ๐ก).
Following the same steps as Lecture 12, telescoping and using the monotonicity of ๐(๐ฅ๐ก)
proved in Theorem L14.5 we obtain the following guarantee.
Theorem L14.6. Let ๐ : ฮฉ โ โ be convex and ๐ฟ-smooth with respect to the norm โยทโ
for which ๐ is strongly convex. Furthemore, let 0 < ๐ โค ๐
๐ฟ , and ๐ฅโ โ ฮฉ be a minimizer
of the function. Then, at any ๐ก, the iterate ๐ฅ๐ก produced by the mirror descent algorithm
satisfies
๐(๐ฅ๐ก) โ ๐(๐ฅโ) โค D๐(๐ฅโ โ ๐ฅ0)
๐๐ก .
L14.4 Further readings
More in-detail treatments of the mirror descent algorithm can be found in several standard
resources, including the nice monograph by Bubeck, S. [Bub15].
[Bub15] Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Founda-
tions and Trends in Machine Learning, 8(3โ4), 231โ357. https://doi.org/10.
1561/2200000050
Changelog
โข Apr 10, 2025: Fixed reference to Lecture 4.
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful
for any reports of typos.
ยนIn this case, the set ฮฉ is not closed, so the existence of the proximal step does not follow quite as directly.
However, we can still show it using elementary arguments; we already encountered this in Lecture 2 and in
PSet 1.