Gabriele Farina - Learning in games: Foundations

MIT 6.S890 — Topics in Multiagent Learning (F23) Mon, September 18th 2021
Lecture 4
Learning in games: Foundations
Instructor: Gabriele Farina
With this class we begin to explore what it means to “learn” in a game, and how that “learning”, which is
intrinsically a dynamic and local (per-player) concept, relates to the much more static and global concept of
game-theoretic equilibrium.
1 Hindsight rationality and Φ-regret
What does it mean to “learn” in games? The answer to this question is delicate. The history of learning in
games historically spanned several subfields. A powerful definition for what “learning in games” means is
through the concept of hindsight rationality.
Take the point of view of one player in a game, and let X be their set of available strategies. In normal-form
games, we have seen that a strategy is just a distribution over the set of available actions S, so X = ∆S . At
each time t = 1, 2, . . . , the player will play some strategy x(t) ∈ X , receive some form of feedback, and will
incorporate that feedback to formulate a “better” strategy x(t+1) ∈ X for the next repetition of the game. A
typical (and natural) choice of “feedback” is just the utility of the player, given what all the other agents
played.
Now suppose that the game is played infinite times, and looking back at what was played by the player we
realize that every single time the player played a certain strategy x, they would have been strictly better by
consistently playing different strategy x′ instead. Can we really say that the player has “learnt” how to play?
Perhaps not.
That leads to the idea of hindsight rationality:
Definition 1.1 (Hindsight rationality, informal). The player has “learnt” to play the game when looking
back at the history of play, they cannot think of any transformation φ : X → X of their strategies that
when applied at the whole history of play would have given strictly better utility to the player.
This leads to the following definition.
Definition 1.2 (Φ-regret minimizer). Given a set X of points and a set Φ of linear transformations
φ : X → X , a Φ-regret minimizer for the set X is a model for a decision maker that repeatedly interacts
with a black-box environment. At each time t, the regret minimizer interacts with the environment
through two operations:
• NextStrategy has the effect that the regret minimizer will output an element x(t) ∈ X ;
• ObserveUtility(u(t)) provides the environment’s feedback to the regret minimizer, in the form of a
linear utility function u(t) : X → ℝ that evaluates how good the last-output point x(t) was. The
utility function can depend adversarially on the outputs x(1), . . . , x(t) if the regret minimizer is
deterministic (i.e., does not use randomness internallya).
Its quality metric is its cumulative Φ-regret, defined as the quantity
R(T )
Φ := max
ˆφ∈Φ
{ T∑
t=1
(
u(t)( ˆφ(x(t))) − u(t)(x(t))
)}
, (1)
The goal for a Φ-regret minimizer is to guarantee that its Φ-regret grows asymptotically sublinearly as
time T increases.
aWhen randomness is involved, the utility function cannot depend adversarially on x(t) or guaranteeing sublinear regret
would be impossible. Rather, u(t) must be conditionally independent on x(t), given all past random outcomes.
Calls to NextStrategy and ObserveUtility keep alternating to each other: first, the regret minimizer will
output a point x(1), then it will received feedback u(1) from the environment, then it will output a new point
x(2), and so on. The decision making encoded by the regret minimizer is online, in the sense that at each
time t, the output of the regret minimizer can depend on the prior outputs x(1), . . . , x(t−1) and corresponding
observed utility functions u(1), . . . , u(t−1), but no information about future utilities is available.
1.1 Some notable choices for the set of transformations Φ considered
The size of the set of transformations Φ considered by the player defines a natural notion of how “rational”
the agent is. There are several choices of interest for Φ for a normal-form strategy space X = ∆S .
1. Φ = set of all stochastic matrices, mapping ∆S → ∆S . This notion of Φ-regret is known under the name
swap regret.
2. Φ = set of all “probability mass transport” on X , defined as
Φ = {φa→b}a,b∈S , where φa→b(x)[s] :=



0 if s = a (remove mass from a...)
x[b] + x[a] if s = b (... and give it to b)
x[s] otherwise.
This is known as internal regret.
Theorem 1.1 (Informal, formal version in Appendix B). When all agents in a multiplayer general-
sum normal-form game play so that their internal or swap regret grows sublinearly, their average
correlated distribution of play converges to the set of correlated equilibria of the game.
3. Sneak peak: in sequential games, the above concept extends to Φ = a particular set of linear transfor-
mations called trigger deviation functions. It is known that in this case the Φ-regret can be efficiently
bounded with a polynomial dependence on the size of the game tree. The reason why this choice of
deviation functions is important is given by the following fact.
Theorem 1.2 (Informal). When all agents in a multiplayer general-sum extensive-form game play so
that their Φ-regret relative to trigger deviation functions grows sublinearly, their average correlated
distribution of play converges to the set of extensive-form correlated equilibria of the game.
4. Φ = constant transformations. In this case, we are only requiring that the player not regret substituting
all of the strategies they played with the same strategy ˆx ∈ ∆S . Φ-regret according to this set of
transformations Φ is usually called external regret, or more simply just regret. While this seems like an
extremely restricted notion of rationality, it actually turns out to be already extremely powerful. We
will spend the rest of this class to see why.
Theorem 1.3 (Informal). When all agents in a multiplayer general-sum normal-form game play so
that their external regret grows sublinearly, their average correlated distribution of play converges
to the set of coarse correlated equilibrium of the game.
Theorem 1.4 (Informal). When all agents in a two-player zero-sum normal-form game play so that
their external regret grows sublinearly, their average strategies converge to the set of Nash equilibria
of the game.
1.2 A very important special case: regret minimization
The special case where Φ is chosen to be the set of constant transformations is so important that it warrants
its own special definition and notation.
Definition 1.3 (Regret minimizer). Let X be a set. An external regret minimizer for X —or simply “regret
minimizer for X ”—is a Φconst-regret minimizer for the special set of constant transformations
Φconst := {φˆx : x 7 → ˆx}ˆx∈X .
Its corresponding Φconst-regret is called “external regret” or simply “regret”, and it is indicated with the
symbol
R(T ) := max
ˆx∈X
{ T∑
t=1
(
u(t)( ˆx) − u(t)(x(t))
)}
. (2)
Once again, the goal for a regret minimizer is to have its cumulative regret RT grow sublinearly in T .
An important result in the subfield of online linear optimization asserts the existence of algorithms that
guarantee sublinear regret for any convex and compact domain X , typically of the order RT = O(√T )
asymptotically.
As it turns out, external regret minimization alone is enough to guarantee convergence to Nash equilibrium
in two-player zero-sum games, to coarse correlated equilibrium in multiplayer general-sum games, to best
responses to static stochastic opponents in multiplayer general-sum games, and much more. Before we delve
into those aspects, however, we first show another important property of regret minimization: general Φ-regret
minimization can reduced to it, in a precise sense.
1.3 From regret minimization to Φ-regret minimization
As we have seen, regret minimization is a very narrow instantiation of Φ-regret minimization—perhaps
the smallest sensible instantiation. Then, clearly, the problem of coming up with a regret minimizer for
a set X cannot be harder than the problem of coming up with a Φ-regret minimizer for X for richer sets
of transformation functions Φ. It might then seem surprising that there exists a construction that reduces
Φ-regret minimization to regret minimization.
More precisely, a result by Gordon et al. [2008] gives a way to construct a Φ-regret minimizer for X starting
from any regret minimizer for the set of functions Φ. The result goes as follows.
Theorem 1.5 (Gordon et al. [2008]). Let R be a deterministic regret minimizer for the set of transformations
Φ whose (external) cumulative regret R(T ) grows sublinearly in T , and assume that every φ ∈ Φ admits a
fixed point φ(x) = x ∈ X . Then, a Φ-regret minimizer RΦ can be constructed starting from R as follows:
• Each call to RΦ.NextStrategy first calls NextStrategy on R to obtain the next transformation φ(t).
Then, a fixed point x(t) = φ(t)(x(t)) is computed and output.
• Each call to RΦ.ObserveUtility(u(t)) with linear utility function u(t) constructs the linear utility
function L(t) : φ 7 → u(t)(φ(x(t))), where x(t) is the last-output strategy, and passes it to R by calling
R.ObserveUtility(L(t)).b
Furthermore, the Φ-regret R(T )
Φ cumulated up to time T by RΦ we have just defined is exactly equal to
the (external) cumulative regret R(T ) cumulated by R:
R(T )
Φ = R(T ) ∀ T = 1, 2, . . . .
So, because the regret cumulated by R grows sublinearly by hypothesis, then so does the Φ-regret
cumulated by RΦ.
bOn the surface, it might look like L(t) is independent on the last output φ(t) of the regret minimizer R, and thus, that
it trivially satisfies the requirements of Definition 1.3. However, that is not true: xt is a fixed point of φt, and since xt
enters into the definition of Lt, if R picks φt randomly, it might very well be that Lt is not conditionally independent on φt.
We sidestep this issue by requiring that R is deterministic (cf. Footnote a).
Proof. The proof of correctness of the above construction is deceptively simple. Since R outputs trans-
formations φ(1), φ(2), · · · ∈ Φ and receives utilities φ 7 → u(1)(φ(x(1))), φ 7 → u(2)(φ(x(2))), . . . , its cumulative
regret R(T ) is by definition
R(T ) = max
ˆφ∈Φ
{ T∑
t=1
(
u(t)( ˆφ(x(t))) − u(t)(φ(t)(x(t)))
)}
.
Now, since by construction x(t) is a fixed point of φ(t), φ(t)(x(t)) = x(t), and therefore we can write
R(T ) = max
ˆφ∈Φ
{ T∑
t=1
(
u(t)( ˆφ(x(t))) − u(t)(x(t))
)}
, (3)
where the right-hand side is exactly the cumulative Φ-regret R(T )
Φ incurred by RΦ, as defined in (2).
2 Applications of regret minimization
In order to establish regret minimization as a meaningful abstraction for learning in games, we must check
that regret minimizing and Φ-regret minimizing dynamics indeed lead to “interesting” or expected behavior in
common situations.
2.1 Learning a best response against stochastic opponents
As a first smoke test, let’s verify that over time a regret minimizer would learn how to best respond to static,
stochastic opponents. Specifically, consider this scenario. We are playing a repeated n-player general-sum game
with multilinear utilities (this captures normal-form game and extensive-form games alike), where Players
i = 1, . . . , n − 1 play stochastically, that is, at each t they independently sample a strategy x(t)
i ∈ Xi from the
same fixed distribution (which is unknown to any other player). Formally, this means that
𝔼[x(t)
i ] = ̄xi ∀ i = 1, . . . , n − 1, t = 1, 2, . . . .
Player n, on the other hand, is learning in the game, picking strategies according to some algorithm that
guarantees sublinear external regret, where the feedback observed by Player n at each time t is their own
linear utility function:
u(t) := Xn ∋ xn 7 → un(x(t)
1 , . . . , x(t)
n−1, xn).
Then, the average of the strategies played by Player n converges almost surely to a best response to
̄x1, . . . , ̄xn−1, that is,
1
T
T∑
t=1
x(t)
n
a.s.
−−−→ arg max
ˆxn ∈Xn
{
un( ̄x1, . . . , ̄xn−1, ˆxn)
}
.
(You should try to prove this!)
2.2 Self-play convergence to bilinear saddle points (such as a Nash equilibrium in a
two-player zero-sum game)
It turns out that regret minimization can be used to converge to bilinear saddle points, that is solutions to
problems of the form
max
x∈X min
y∈Y x⊤Ay, (4)
where X and Y are convex compact sets and A is a matrix. These types of optimization problems are pervasive
in game-theory. The canonical prototype of bilinear saddle point problem is the computation of Nash equilibria
in two-player zero-sum games (either normal-form or extensive-form). There, a Nash equilibrium is the solution
to (4) where X and Y are the strategy spaces of Player 1 and Player 2 respectively (probability simplexes for
normal-form games or sequence-form polytopes for extensive-form games), and A is the payoff matrix for
Player 1. Other examples include social-welfare-maximizing correlated equilibria and optimal strategies in
two-team zero-sum adversarial team games.
The idea behind using regret minimization to converge to bilinear saddle-point problems is to use self play.
We instantiate two regret minimization algorithms, RX and RY , for the domains of the maximization and
minimization problem, respectively. At each time t the two regret minimizers output strategies x(t) and y(t),
respectively. Then, they receive feedback u(t)
X , u(t)
Y defined as
u(t)
X : x 7 → (Ay(t))⊤x, u(t)
Y : y 7 → −(A⊤x(t))⊤y.
We summarize the process pictorially in Figure 1.
RX
RY
u(t−1)
X
u(t−1)
Y
x(t)
y(t) u(t)
Y
u(t)
X RX
RY
x(t+1)
y(t+1) · · ·· · ·
Figure 1: The flow of strategies and utilities in regret minimization for games. The symbol denotes
computation/construction of the utility function.
A well known folk theorem establish that the pair of average strategies produced by the regret minimizers
up to any time T converges to a saddle point of (4), where convergence is measured via the saddle point gap
0 ≤ γ(x, y) :=
(
max
ˆx∈X { ˆx⊤Ay} − x⊤Ay
)
+
(
x⊤Ay − min
ˆy∈Y {x⊤A ˆy}
)
= max
ˆx∈X { ˆx⊤Ay} − min
ˆx∈X {x⊤A ˆy}.
A point (x, y) ∈ X × Y has zero saddle point gap if and only if it is a solution to (4).
Theorem 2.1. Consider the self-play setup summarized in Figure 1, where RX and RY are regret
minimizers for the sets X and Y, respectively. Let R(T )
X and R(T )
Y be the (sublinear) regret cumulated
by RX and RY , respectively, up to time T , and let ̄x(T ) and ̄y(T ) denote the average of the strategies
produced up to time T . Then, the saddle point gap γ( ̄x(T ), ̄y(T )) of ( ̄x(T ), ̄y(T )) satisfies
γ( ̄x(T ), ̄y(T )) ≤ R(T )
X + R(T )
Y
T → 0 as T → ∞.
Proof. By definition of regret,
R(T )
X + R(T )
Y
T = 1
T max
ˆx∈X
{ T∑
t=1
u(t)
X ( ˆx)
}
− 1
T
T∑
t=1
u(t)
X (xt) + 1
T max
ˆy∈Y
{ T∑
t=1
u(t)
Y ( ˆy)
}
− 1
T
T∑
t=1
u(t)
Y (yt)
= 1
T max
ˆx∈X
{ T∑
t=1
u(t)
X ( ˆx)
}
+ 1
T max
ˆy∈Y
{ T∑
t=1
u(t)
Y ( ˆy)
}
(since u(t)
X (x(t)) + u(t)
Y (y(t)) = 0)
= 1
T max
ˆx∈X
{ T∑
t=1
ˆx⊤Ay(t)
}
+ 1
T max
ˆy∈Y
{ T∑
t=1
−(x(t))⊤A ˆy
}
= max
ˆx∈X
{
ˆx⊤A ̄y(T )}
− min
ˆy∈Y
{
( ̄x(T ))⊤A ˆy
}
= γ( ̄x(T ), ̄y(T )).
A Appendix: Recap of notation from past classes
Definition A.1. A finite n-player normal-form game is described by:
• a set of actions (aka. pure strategies) for each player i, denoted Si;
• a utility/payoff function for each player i: ui : ×j Sj → ℝ
(Remark: this can be thought of as a tensor).
Definition A.2. A randomized/mixed strategy for player i is any xi ∈ ∆Si .
Definition A.3. Player i’s expected utility is
ui(x1, . . . , xn) = 𝔼s1 ∼x1
...
sn ∼xn
ui(s1, . . . , sn) = ∑
s1 ∈S1
· · · ∑
sn ∈Sn
x1[s1] · · · xn[sn] · ui(s1, . . . , sn).
Definition A.4 (Correlated equilibrium). An correlated equilibrium is a joint distribution D(s1, . . . , sn)
such that for every player i, and every pair of pure strategies a and b such that si is sampled with nonzero
probability,
𝔼
s−i∼D(·|si=a) ui(a, s−i) ≥ 𝔼
s−i∼D(·|si=a) ui(b, s−i),
or equivalently (by multiplying by ℙD [si = a]),
𝔼
s∼D
[(ui(b, s−i) − ui(a, s−i)) · 𝟙[si = a]
]
≤ 0.
Definition A.5 (Approximate correlated equilibrium). D is an ε-approximate correlated equilibrium if
𝔼
s∼D
[(ui(b, s−i) − ui(a, s−i)) · 𝟙[si = a]
]
≤ ε.
B Appendix: Proof of Theorem 1.1
Theorem B.1 (Formal version of Theorem 1.1). Let x(t)
1 , . . . , x(t)
n the strategies played by the players at
any time t, and let IntReg(t)
i denote the internal regret incurred by Player i up to time t. Consider now
the average correlated distribution of play up to any time T , that is, the distribution D(T ) that selects a
time ̄t uniformly at random from the set {1, . . . , T }, and selects actions (s1, . . . , sn) independendently
according to the x( ̄t)
i . This distribution is an ε(T )-correlated equilibrium, where
ε(T ) := max
i∈{1,...,n}
IntReg(T )
i
T
and IntReg denotes the internal regret of player i.
Proof. Pick any player i and strategies a, b ∈ Si. The maximum benefit that player i can obtain by deviating
from action a to action b is given by,
𝔼
s∼D(T )
[(ui(b, s−i) − ui(a, s−i)) · 𝟙[si = a]
]
.
Expanding the specific structure of D(T ), we can decompose the expectation as
1
T
T∑
t=1
𝔼
s∼x(t)
1 ⊗···⊗x(t)
n
[(ui(b, s−i) − ui(a, s−i)) · 𝟙[si = a]
]
= 1
T
T∑
t=1
𝔼
si∼x(t)
i
𝔼
s−i∼⊗x(t)
−i
[(ui(b, s−i) − ui(a, s−i)) · 𝟙[si = a]
]
= 1
T
T∑
t=1
𝔼
si∼x(t)
i
[(ui(b, x(t)
−i) − ui(a, x(t)
−i)) · 𝟙[si = a]
]
= 1
T
T∑
t=1
((ui(b, x(t)
−i) − ui(a, x(t)
−i)) · x(t)
i [a]
)
.
At this point, the proof is concluded if we can show that the argument in the square brackets is upper
bounded by IntReg(T )
i . This is easy by considering the deviation φa→b defined above:
φa→b(x)[s] :=



0 if s = a (remove mass from a...)
x[b] + x[a] if s = b (... and give it to b)
x[s] otherwise.
In particular, by definition of internal regret,
IntReg(T )
i ≥
T∑
t=1
(
ui(φa→b(x(t)
i ), x(t)
−i) − ui(x(t)
i , x(t)
−i)
)
=
T∑
t=1
∑
si∼x(t)
i
(φa→b(x(t)
i )[si] − x(t)
i [si]) · ui(si, x(t)
−i)
=
T∑
t=1
(
− x(t)
i [a] · ui(a, x(t)
−i) + xi[a] · ui(b, x(t)
−i)
)
=
T∑
t=1
((ui(b, x(t)
−i) − ui(a, x(t)
−i)) · x(t)
i [a]
)
.
So, the maximum benefit of any deviation of player i is bounded above by 1
T IntReg(T )
i . It follows that the
maximum deviation across any player is at most ε(T ) as defined in the statement.
References
Geoffrey J Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In Proceedings of
the International Conference on Machine Learning (ICML), pages 360–367. ACM, 2008.

Content

Typo or question?

Metadata