Gabriele Farina - Online mirror descent

MIT 6.7220/15.084 — Nonlinear Optimization Thu, May 2nd 2024
Lecture 18
Online mirror descent
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
We saw in Lecture 9 that the mirror descent algorithm is a generalization of projected gradient
descent that works for generalized notions of distance. In this lecture, we will see how to apply mirror
descent to online learning problems. The resulting algorithm is called online mirror descent, often
abbreviated with the acronym OMD.
1 From mirror descent to online mirror descent
Recall that the mirror descent algorithm works with generalized notions of distances called Bregman
divergences. Bregman divergences are “generated” by a 1-strongly convex function 𝜑 : Ω → ℝ called
a distance-generating function. In turn, the Bregman divergence defines the proximal operator:
Prox𝜑(𝜂∇𝑓(𝑥𝑡), 𝑥𝑡) ≔ arg min
𝑥 𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + D𝜑(𝑥 ‖ 𝑥𝑡)
s.t. 𝑥 ∈ Ω.
When 𝜑 is the squared Euclidean norm 𝜑(𝑥) = 1
2 ‖𝑥‖2
2, the proximal operator reduces to the usual
Euclidean projection operator ΠΩ(𝑥𝑡 − 𝜂∇𝑓(𝑥𝑡)), so that mirror descent reduces to regular (Euclid-
ean) projected gradient descent.
With this notation under our belt, the regular (offline) version of mirror descent picks iterates ac-
cording to
𝑥𝑡+1 ≔ Prox𝜑(𝜂∇𝑓 (𝑥𝑡), 𝑥𝑡).
The online version of mirror descent is equally straightforward: simply replace the function 𝑓 with
the time-varying objective 𝑓𝑡:
𝑥𝑡+1 ≔ Prox𝜑(𝜂∇𝑓𝑡(𝑥𝑡), 𝑥𝑡) .
So, when 𝜑 is the squared Euclidean norm, the online mirror descent algorithm reduces to an online
version of projected gradient descent, called online projected gradient descent. As we remarked in
Lecture 9, other choices of distance-generating functions 𝜑 are possible depending on the domain
Ω of interest. In Section 2, we will for example see how the negative entropy distance-generating
function 𝜑(𝑥) = 𝑥1 log 𝑥1 + ⋯ + 𝑥𝑛 log 𝑥𝑛 leads to an especially appealing algorithm in the context
of games, where Ω is the set of randomized strategies of the players, that is, the set of probability
distributions over the actions available to each player.
1.1 Regret analysis
Theorem 1.1. Let ‖⋅‖ be the norm with respect to which the DGF 𝜑 is 1-strongly convex, and
‖⋅‖∗ be the dual norm. If all functions 𝑓𝑡 : Ω → ℝ are convex, the regret of online mirror descent
is bounded by
𝑅𝑇 ≔ ∑
𝑇
𝑡=1
𝑓𝑡(𝑥𝑡) − min
𝑥∈Ω ∑
𝑇
𝑡=1
𝑓𝑡(𝑥) ≤
1
𝜂 D𝜑(𝑥 ‖ 𝑥1) +
𝜂
2 ∑
𝑇
𝑡=1
‖∇𝑓𝑡(𝑥𝑡)‖2
∗ .
In particular, assuming that all dual gradient norms are upper bounded by 𝐺, and setting
𝜂 ≔ √2D𝜑(𝑥 ‖ 𝑥1)
𝐺2𝑇
we find
𝑅𝑇 ≤ 𝐺√2D𝜑(𝑥 ‖ 𝑥1) ⋅ 𝑇 .
Proof . Since 𝑓𝑡 is convex, we can use the mirror descent lemma (see Lecture 9), which states that,
𝑓𝑡(𝑥𝑡) ≤ 𝑓𝑡(𝑥) + ⟨∇𝑓𝑡(𝑥𝑡), 𝑥𝑡 − 𝑥𝑡+1⟩ −
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡+1) +
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡) −
1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡) ∀𝑥 ∈ Ω.
Using the Cauchy-Schwarz inequality, we can bound the right-hand side by
𝑓𝑡(𝑥𝑡) ≤ 𝑓𝑡(𝑥) + ‖∇𝑓𝑡(𝑥𝑡)‖∗ ⋅ ‖𝑥𝑡 − 𝑥𝑡+1‖ −
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡+1) +
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡) −
1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡).
Using Young’s inequality, as well as the 1-strong convexity of 𝜑, which implies 1
𝜂 D𝜑(𝑥𝑡+1 ‖ 𝑥𝑡) ≥
1
2𝜂 ‖𝑥𝑡+1 − 𝑥𝑡‖2, we therefore have
𝑓𝑡(𝑥𝑡) ≤ 𝑓𝑡(𝑥) +
𝜂
2 ‖∇𝑓𝑡(𝑥𝑡)‖2
∗ +
1
2𝜂 ‖𝑥𝑡 − 𝑥𝑡+1‖2 −
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡+1) +
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡) −
1
2𝜂 ‖𝑥𝑡+1 − 𝑥𝑡‖2
≤ 𝑓𝑡(𝑥) +
𝜂
2 ‖∇𝑓𝑡(𝑥𝑡)‖2
∗ −
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡+1) +
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑡).
Summing over 𝑡 = 1, ..., 𝑇 , we find
∑
𝑇
𝑡=1
𝑓𝑡(𝑥𝑡) ≤ ∑
𝑇
𝑡=1
𝑓𝑡(𝑥) +
𝜂
2 ∑
𝑇
𝑡=1
‖∇𝑓𝑡(𝑥𝑡)‖2
∗ −
1
𝜂 D𝜑(𝑥 ‖ 𝑥𝑇 ) +
1
𝜂 D𝜑(𝑥 ‖ 𝑥1).
Ignoring the negative term yields the first part of the statement. □
1.2 Coping with an unknown time horizon
One might object that in an online learning setting, the time horizon 𝑇 is not known in advance.
Luckily, it turns out that the bound we just established can be adapted to this case. We start with
a black-box reduction, called the doubling trick.
Remark 1.1. If 𝑇 is not known in advance, we can use the doubling trick to obtain a bound of
𝑂(𝐺√D𝜑(𝑥 ‖ 𝑥1) log 𝑇 ). The idea is to change the value of 𝜂 every time we reach an iteration
count that is a power of 2. So, we set 𝜂 = √2D𝜑(𝑥 ‖ 𝑥1)/(𝐺22𝑘) for 2𝑘 ≤ 𝑇 < 2𝑘+1.
Proof . The regret incurred by the algorithm is upper bound by the regret incurred in each of the
intervals 2𝑘 ≤ 𝑇 < 2𝑘+1. Suppose the algorithm has been run until time 2𝐾 ≤ 𝑇 < 2𝐾+1. Hence,
the regret is upper bounded by
Reg𝑇 ≤
⎝
⎜⎜⎛𝐺√ D𝜑(𝑥 ‖ 𝑥1)
2 ∑
𝐾
𝑘=0
(√2)𝑘
⎠
⎟⎟⎞ +
⎝
⎜⎜⎛ ∑
𝐾−1
𝑘=0
√D𝜑(𝑥 ‖ 𝑥1)
2𝐺22𝑘 𝐺22𝑘
⎠
⎟⎟⎞ +
⎝
⎜⎜⎛√2D𝜑(𝑥 ‖ 𝑥1)
2𝐺22𝐾
⎠
⎟⎟⎞𝐺2(𝑇 − 2𝐾 )
= 𝐺√ D𝜑(𝑥 ‖ 𝑥1)
2 (∑
𝐾
𝑘=0
(√2)𝑘
+ ∑
𝐾−1
𝑘=0
(√2)𝑘
) +
⎝
⎜⎜⎛√D𝜑(𝑥 ‖ 𝑥1)
2𝐺22𝐾
⎠
⎟⎟⎞𝐺2(𝑇 − 2𝐾 )
In particular, since 𝑇 < 2𝐾+1,
Reg𝑇 ≤ 𝐺√ D𝜑(𝑥 ‖ 𝑥1)
2 (∑
𝐾
𝑘=0
(√2)𝑘
+ ∑
𝐾−1
𝑘=0
(√2)𝑘
) +
⎝
⎜⎜⎛√D𝜑(𝑥 ‖ 𝑥1)
2𝐺22𝐾
⎠
⎟⎟⎞𝐺22𝐾
= 𝐺√ D𝜑(𝑥 ‖ 𝑥1)
2 (∑
𝐾
𝑘=0
(√2)𝑘
+ ∑
𝐾
𝑘=0
(√2)𝑘
)
= 𝐺√2D𝜑(𝑥 ‖ 𝑥1) ∑
𝐾
𝑘=0
(√2)𝑘
= 𝐺√2D𝜑(𝑥 ‖ 𝑥1)
(√2)𝐾+1
− 1
√2 − 1 ≤ 𝐺√2D𝜑(𝑥 ‖ 𝑥1) ⋅ 𝑇
√2
√2 − 1.
Hence, the doubling trick guarantees a degradation in the regret bound by a factor at most√2/(√2 − 1) ≈ 3.41. □
Remark 1.2. An alternative approach to not knowing the time horizon is to dynamically change
the stepsize 𝜂 at all times 𝑡, by simply setting it to
𝜂𝑡 ≔ √2D𝜑(𝑥 ‖ 𝑥1)
𝐺2𝑡
(that is, replace 𝑇 with 𝑡 in the formula for 𝜂). This approach is sound, but working out the
details requires being more careful when using the telescoping argument enabled by the mirror
descent lemma used in the proof of Theorem 1.1.
2 Two notable examples
Let’s investigate applying the online mirror descent algorithm with two standard choices of distance-
generating functions: the squared Euclidean norm and negative entropy.
2.1 Online gradient descent
When the distance-generating function is the squared Euclidean norm, proximal steps are Euclidean
projection. In this case, with are left with the algorithm
𝑥𝑡+1 ≔ ΠΩ(𝑥𝑡 − 𝜂∇𝑓𝑡(𝑥𝑡)) ,
which is known with the name online gradient descent (OGD).
2.2 Multiplicative weights update
When Ω = Δ𝑛 is a probability simplex (this is the case when playing a normal-form games), the
negative entropy is a natural choice for the distance-generating function (see also Lecture 9). In this
case, we have the algorithm
𝑥𝑡+1 ∝ 𝑥𝑡 ⊙ exp{−𝜂∇𝑓𝑡(𝑥𝑡)} ,
where ⊙ denotes componentwise multiplication, and the exponential is meant again component-
wise. Starting from the uniform distribution 𝑥1 = ( 1
𝑛 , ..., 1
𝑛 ) Applying the general bound derived in
Theorem 1.1, we find that the regret of this algorithm is bounded by
𝑅𝑇 ≤ 𝐺√2D𝜑(𝑥 ‖ 𝑥1) ⋅ 𝑇 ≤ 𝐺√2 log(𝑛) ⋅ 𝑇 ,
where again 𝐺 ≔ max ‖∇𝑓𝑡(𝑥𝑡)‖∞.
Remark 2.1. In games, 𝐺 is bounded by the maximum absolute value of any payoff in the game.
Hence, the regret of the multiplicative weights update algorithm scales only with the logarithm
of the number of actions of the player!

Content

Typo or question?

Metadata