Gabriele Farina - Online optimization and regret

MIT 6.7220/15.084 — Nonlinear Optimization Tue, Apr 30th 2024
Lecture 17
Online optimization and regret
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
Online optimization has been a central topic in machine learning and optimization for the past two
decades. The main idea is to design algorithms that can make decisions in an online fashion, in a
nonstationary environment.
1 Online optimization
Online optimization is an abstraction for repeated decision-making. In particular, at each time step
𝑡, we want to select a point 𝑥𝑡 ∈ Ω, where Ω is a specified feasible set. Based on the decision 𝑥𝑡,
the algorithm then receives a loss function 𝑓𝑡 : Ω → ℝ that reveals the quality of the decision. Both
the decision-maker and the environment have memory: the decision-maker can take into account the
past losses 𝑓1, ..., 𝑓𝑡−1 when deciding the next decision 𝑥𝑡 ∈ Ω, and the environment can generate—
possibly adversarially—losses 𝑓𝑡 based on all decisions 𝑥1, ..., 𝑥𝑡 observed so far.
Depending on the specific assumptions regarding how adversarial the environment is, and how much
is revealed regarding 𝑓𝑡 to the online optimizer (that is, the decision-maker), people have special
names for some settings. For example:
• The environment is said to be stochastic if the 𝑓𝑡 are drawn from an unknown fixed distrib-
ution.
• The problem is said to be bandit if the decision-maker only observes the loss 𝑓𝑡(𝑥𝑡) of the
decision 𝑥𝑡 made at time 𝑡, instead of the function 𝑓𝑡 itself.
• Sometimes people like to refer to the setting where the decision-maker has full knowledge of
the loss functions 𝑓𝑡 as full-information online optimization.
Overall, we will focus in the next couple of lectures on the full-information, adversarial case. Most
of the other settings can often be reduced to this one, including the challenging adversarial bandit
setting.
The performance of an online optimization algorithm is typically measured by its regret.
Definition 1.1 (Regret). The regret incurred by a sequence of 𝑡 decisions 𝑥1, 𝑥2, ..., 𝑥𝑡 ∈ Ω given
the revealed loss functions 𝑓1, 𝑓2, ..., 𝑓𝑡 is defined as
Reg𝑇 ≔ ∑
𝑇
𝑡=1
𝑓𝑡(𝑥𝑡) − min
𝑥∈Ω ∑
𝑇
𝑡=1
𝑓𝑡(𝑥),
that is, the difference between the loss accumulated and the minimum loss that could have been
accumulated in hindsight by the best, fixed decision in Ω.
The goal of an online optimization algorithm is to guarantee that the regret is as small as possible.
In particular, we are interested in algorithms that have sublinear regret as a function of the number
of iterations 𝑇 of the algorithm.
Remark 1.1. Regret can be negative! This is because we are trading off the ability to make time-
dependent decisions against an unknown loss function, versus making the best fixed decision with
hindsight.
Other performance metrics have been proposed, such as dynamic regret, switching regret, swap re-
gret, and more. However, the standard notion of regret defined above is (i) the most common, (ii)
already nontrivial to keep sublinear, and (iii) already extremely useful in applications.¹ More specif-
¹To distinguish the notion of regret defined in Definition 1.1 from these other notions, sometimes people like to
refer to the notion in Definition 1.1 using the terms static regret (common in the online/reinforcement learning theory
literature) or external regret (common in the game theory literature).
ically, as we will see next, the notion of regret defined above is related to equilibrium approximation
in game theory.
Today, it is known how to construct online optimization algorithms that achieve sublinear regret
in a wide range of settings. Specifically, when Ω is closed and convex and the losses 𝑓𝑡 are convex
and Lipschitz-continuous, it is well-understood that sublinear regret can be achieved via an online
generalization of the mirror descent algorithm that we will discuss in the next lecture.
2 Connections to decision-making and game theory
Online optimization has important applications to decision-making and game theory. Techniques
based on online optimization are used in a wide range of applications, including reinforcement learn-
ing, online advertising, online auctions, and game solving. In particular, landmark AI results such
as superhuman performance in poker have been achieved using online optimization techniques.
In this section, we will see how online optimization can be useful in solving nonsequential, simulta-
neous-action games. In a later class, we will see how to use online optimization to solve imperfect-
information sequential games like poker.
2.1 Normal-form games
A normal-form game is a game in which players simultaneously choose actions and receive payoffs.
The prototypical example of a normal-form game is rock-paper-scissors.
■ Actions and payoffs. In a normal-form game, each player 𝑖 has a set of actions 𝐴𝑖, and each player
receives a payoff that depends on the actions chosen by all players. For this class, we will focus on
two-player games. Letting the actions of Player 1 on the rows, and the actions of Player 2 on the
columns, we can arrange the payoff obtained by each player 𝑖 ∈ {1, 2} in a matrix 𝑈𝑖 ∈ ℝ𝐴1×𝐴2 .
Example 2.1. In rock-paper-scissors, we have
𝑈1 ≔
⎝
⎜⎛ 0
1
−1
−1
0
1
1
−1
0 ⎠
⎟⎞, 𝑈2 ≔ −𝑈1.
When the payoffs are opposite for the players, that is, 𝑈2 = −𝑈1, the game is aptly called zero-sum.
Definition 2.1 (Zero-sum game). A two-player game is said to be zero-sum if 𝑈2 = −𝑈1.
■ Mixed strategies. Players can randomize their choice of actions, that is, play according to prob-
ability distribution. A mixed strategy for Player 1 is a vector 𝑥 ∈ Δ(𝐴1), where Δ(𝐴1) is the set of
probability distributions over 𝐴1. Similarly, a mixed strategy for Player 2 is a vector 𝑦 ∈ Δ(𝐴2).
Given a choice of mixed strategies for each player, the expected utility for each player is then given by
𝑢1(𝑥, 𝑦) = 𝑥⊤𝑈1𝑦, 𝑢2(𝑥, 𝑦) = 𝑥⊤𝑈2𝑦.
■ Nash equilibrium. The Nash equilibrium is a central concept in game theory. It is a strategy
profile in which no player has an incentive to deviate from their strategy.
Definition 2.2 (Nash equilibrium in two-player games). A strategy profile (𝑥, 𝑦) is a Nash equi-
librium in a two-player game if
𝑥⊤𝑈1𝑦 = max
̂
𝑥∈Δ(𝐴1)
̂ 𝑥⊤𝑈1𝑦 (Player 1 is playing optimally against Player 2); and
𝑥⊤𝑈2𝑦 = max
̂
𝑦∈Δ(𝐴2)
𝑥⊤𝑈2 ̂ 𝑦 (Player 2 is playing optimally against Player 1).
A strategy profile (𝑥, 𝑦) is an 𝜀-approximate Nash equilibrium (or simply 𝜀-Nash equilibrium) if
max
̂
𝑥∈Δ(𝐴1)
{ ̂𝑥⊤𝑈1𝑦} − 𝑥⊤𝑈1𝑦 ≤ 𝜀 and max
̂
𝑦∈Δ(𝐴2)
{𝑥⊤𝑈2 ̂ 𝑦} − 𝑥⊤𝑈2𝑦 ≤ 𝜀.
In two-player zero-sum games, where by definition 𝑈2 = −𝑈1,
𝑥⊤𝑈2𝑦 = max
̂
𝑦∈Δ(𝐴2)
𝑥⊤𝑈2 ̂ 𝑦 is equivalent to 𝑥⊤𝑈1𝑦 = min
̂
𝑦∈Δ(𝐴2) 𝑥⊤𝑈1 ̂ 𝑦.
Hence, Player 2 is selecting any strategy that minimizes the expected utility of Player 1. No matter
the strategy 𝑥 selected by Player 1, Player 1′s expected utility will therefore be min𝑦∈Δ(𝐴2) 𝑥⊤𝐴𝑦.
Since Player 1 is playing optimally, then their strategy should be a solution to
max
𝑥∈Δ(𝐴1)
( min
𝑦∈Δ(𝐴2) 𝑥⊤𝑈1𝑦).
Hence, a Nash equilibrium in a two-player zero-sum game is the best possible strategy 𝑥 against a
maximally strong adversary. In this light, it is perhaps not surprising that in two-player games, Nash
equilibrium is the preferred solution concept for designing AI agents playing against top professionals
in go, chess, poker, et cetera.
2.2 Online optimization in games
Conceptually, players in games would like to compute randomized strategies that maximize their
expected utility given what the other players are playing. However, the learning of all agents makes
the problem inherently nonstationary. This is where online optimization saves the day.
The fundamental idea is to just apply online optimization from the point of each player. At each
time 𝑡, each player selects a randomized strategy 𝑥𝑡 ∈ Δ(𝐴1), 𝑦𝑡 ∈ Δ(𝐴2). Based on the selected
strategies, the loss functions that the players receive are then set to the opposite of their expected
utility:
𝑓𝑡 : Δ(𝐴1) → ℝ, 𝑓𝑡(𝑥) ≔ −𝑥⊤𝑈1𝑦𝑡 = (−𝑈1𝑦𝑡)⊤𝑥
𝑔𝑡 : Δ(𝐴2) → ℝ, 𝑔𝑡(𝑦) ≔ −𝑥⊤
𝑡 𝑈2𝑦 = (−𝑈 ⊤
2 𝑥𝑡)⊤𝑦
This process, called self-play, is repeated for a number of iterations.
Player 1
Player 2
𝑥1 𝑓1(𝑥)
𝑔1(𝑦)𝑦1
Player 1
Player 2
𝑥2 𝑓2(𝑥)
𝑔2(𝑦)𝑦2
⋯
𝑥𝑡 𝑓𝑡(𝑥)
𝑔𝑡(𝑦)𝑦𝑡
Player 1
Player 2
As it turns out, there is a deep connection between the regret accumulated by the players in self-
play and equilibrium approximation in the game. For two-player zero-sum games in particular, the
following strong result holds.
Theorem 2.1. Let
Reg(1)
𝑇 ≔ ∑
𝑇
𝑡=1
𝑓𝑡(𝑥𝑡) − min
𝑥∈Δ(𝐴1)
∑
𝑇
𝑡=1
𝑓𝑡(𝑥),
Reg(2)
𝑇 ≔ ∑
𝑇
𝑡=1
𝑔𝑡(𝑦𝑡) − min
𝑦∈Δ(𝐴2)
∑
𝑇
𝑡=1
𝑔𝑡(𝑦),
be the regrets accumulated by Player 1 and 2, respectively, after playing 𝑇 rounds of the game
in self-play. Then, the average strategy profile
(𝑥𝑇 , 𝑦𝑇 ) ≔ (
1
𝑇 ∑
𝑇
𝑡=1
𝑥𝑡,
1
𝑇 ∑
𝑇
𝑡=1
𝑦𝑡) ∈ Δ(𝐴1) × Δ(𝐴2)
is an 𝜀-Nash equilibrium of the game, where
𝜀 ≔ Reg(1)
𝑇 + Reg(2)
𝑇
𝑇 .
Proof . By definition of 𝑓𝑡 and 𝑔𝑡, we have
Reg(1)
𝑇 = ∑
𝑇
𝑡=1
𝑓𝑡(𝑥𝑡) − min
𝑥∈Δ(𝐴1)
∑
𝑇
𝑡=1
𝑓𝑡(𝑥) = − ∑
𝑇
𝑡=1
𝑥⊤
𝑡 𝑈1𝑦𝑡 − min
𝑥∈Δ(𝐴1)
{− ∑
𝑇
𝑡=1
𝑥⊤𝑈1𝑦𝑡},
Reg(2)
𝑇 = ∑
𝑇
𝑡=1
𝑔𝑡(𝑦𝑡) − min
𝑦∈Δ(𝐴2)
∑
𝑇
𝑡=1
𝑔𝑡(𝑦) = + ∑
𝑇
𝑡=1
𝑥⊤
𝑡 𝑈1𝑦𝑡 − min
𝑦∈Δ(𝐴2)
{+ ∑
𝑇
𝑡=1
𝑥⊤
𝑡 𝑈1𝑦},
Hence, by summing the two expressions and dividing by 𝑇 , we get
Reg(1)
𝑇 + Reg(2)
𝑇
𝑇 = − min
𝑥∈Δ(𝐴1)
{− 1
𝑇 ∑
𝑇
𝑡=1
𝑥⊤𝑈1𝑦𝑡} − min
𝑦∈Δ(𝐴2)
{ 1
𝑇 ∑
𝑇
𝑡=1
𝑥⊤
𝑡 𝑈1𝑦}
= − min
𝑥∈Δ(𝐴1)
{−𝑥⊤𝑈1𝑦𝑇 } − min
𝑦∈Δ(𝐴2)
{𝑥⊤
𝑇 𝑈1𝑦}
= max
𝑥∈Δ(𝐴1)
{𝑥⊤𝑈1𝑦𝑇 } − min
𝑦∈Δ(𝐴2)
{𝑥⊤
𝑇 𝑈1𝑦}
= ( max
𝑥∈Δ(𝐴1)
{𝑥⊤𝑈1𝑦𝑇 } − 𝑥⊤
𝑇 𝑈1𝑦𝑇 ) + (𝑥⊤
𝑇 𝑈1𝑦𝑇 − min
𝑦∈Δ(𝐴2)
{𝑥⊤
𝑇 𝑈1𝑦}).
Hence, we must have
max
̂
𝑥∈Δ(𝐴1)
{ ̂𝑥⊤𝑈1𝑦𝑇 } − 𝑥⊤
𝑇 𝑈1𝑦𝑇 ≤ Reg(1)
𝑇 + Reg(2)
𝑇
𝑇 ; 𝑥⊤
𝑇 𝑈1𝑦𝑇 − min
̂
𝑦∈Δ(𝐴2)
{𝑥⊤
𝑇 𝑈1 ̂ 𝑦} ≤ Reg(1)
𝑇 + Reg(2)
𝑇
𝑇 ,
which is the statement. □
Remark 2.1. The previous theorem relies fundamentally on the game being two-player zero-sum.
In fact, from the PPAD-completeness of Nash equilibrium in more general settings [CDT09;
DGP09], we know that finding Nash equilibria beyond two-player zero-sum games is in general
computationally hard.
Beyond two-player zero-sum games, regret is known to be linked to several notions of correlated
equilibrium, a generalization of Nash equilibrium that allows for coordination between players.
Bibliography
[CDT09] X. Chen, X. Deng, and S.-H. Teng, “Settling the complexity of computing two-player Nash
equilibria,” Journal of the ACM (JACM), vol. 56, no. 3, pp. 1–57, 2009.
[DGP09] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, “The complexity of computing
a Nash equilibrium,” Communications of the ACM, vol. 52, no. 2, pp. 89–97, 2009.

Content

Typo or question?

Metadata