Gabriele Farina - Hindsight rationality and regret minimization

15-888 Computational Game Solving (Fall 2021) Tue, September 14th 2021
Lecture 3
Hindsight rationality and regret minimization
Instructor: Gabriele Farina
With this class we begin to explore what it means to “learn” from repeated play in a game, and how that
dynamic concept, relates to the much more static concept of game-theoretic
equilibria.
1 Hindsight rationality and Φ-regret
What does it mean to “learn” in games? The answer to this question is delicate. The history of learning in
games historically spanned several subfields. A powerful definition for what “learning in games” means is
through the concept of hindsight rationality.
Take the point of view of one player in a game, and let X be their set of available strategies. At each
time t = 1, 2, . . . , the player will play some strategy xt ∈ X , receive some form of feedback (for example, the
gradient of the player’s utility, or maybe just the utility itself), and will incorporate that feedback to formulate
a “better” strategy xt+1 ∈ X for the next repetition of the game.
Now suppose that the game is played infinite times, and looking back at what was played by the player we
realize that every single time the player played a certain strategy a, they would have been strictly better by
consistently playing different strategy b instead. Can we really say that the player has “learnt” how to play?
Perhaps not.
That leads to the idea of hindsight rationality: the player has “learnt” to play the game when looking
back at the history of play, they cannot think of any transformation φ : X → X of their strategies that when
applied at the whole history of play would have given strictly better utility to the player.
This leads to the following definition.
Definition 1.1 (Φ-regret minimizer). Given a set X of points and a set Φ of linear transformations
φ : X → X , a Φ-regret minimizer for the set X is a model for a decision maker that repeatedly interacts
with a black-box environment. At each time t, the regret minimizer interacts with the environment
through two operations:
• NextStrategy has the effect that the regret minimizer will output an element xt ∈ X ;
• ObserveUtility(`t) provides the environment’s feedback to the regret minimizer, in the form of a
linear utility function `t : X → ℝ that evaluates how good the last-output point xt was. The utility
function can depend adversarially on the outputs x1, . . . , xt if the regret minimizer is deterministic
(i.e., does not use randomness internallya).
Its quality metric is its cumulative Φ-regret, defined as the quantity
RT
Φ := max
ˆφ∈Φ
{ T∑
t=1
(
`t( ˆφ(xt)) − `t(xt)
)}
, (1)
The goal for a Φ-regret minimizer is to guarantee that its Φ-regret grows asymptotically sublinearly as
time T increases.
aWhen randomness is involved, the utility function cannot depend adversarially on xt or guaranteeing sublinear regret
would be impossible. Rather, `t must be conditionally independent on xt, given all past random outcomes.
Calls to NextStrategy and ObserveUtility keep alternating to each other: first, the regret minimizer
will output a point x1, then it will received feedback `1 from the environment, then it will output a new point
x2, and so on. The decision making encoded by the regret minimizer is online, in the sense that at each time t,
the output of the regret minimizer can depend on the prior outputs x1, . . . , xt−1 and corresponding observed
utility functions `1, . . . , `t−1, but no information about future utilities is available.
1.1 Some notable choices for the set of transformations Φ considered
The size of the set of transformations Φ considered by the player defines a natural notion of how “rational”
the agent is. There are several choices of interest for Φ.
1. Φ = set of all mappings from X to X . This is the maximum level of hindsight rationality. The
corresponding notion of Φ-regret in this case is known as swap regret.
2. Φ = set of all “single-point deviations” on X , defined as
Φ = {φa→b}a,b∈X , where φa→b : x 7 →
{
x if x 6 = a
b if x = a.
This is known as internal regret.
Fact 1.1. When all agents in a multiplayer general-sum game (normal-form or extensive-form) play
so that their internal or swap regret grows sublinearly, their empirical frequency of play converges
to a correlated equilibrium of the game.
3. Φ = a particular set of linear transformations called trigger deviation functions. It is known that in this
case the Φ-regret can be efficiently bounded with a polynomial dependence on the size of the game tree.
The reason why this choice of deviation functions is important is given by the following fact.
Fact 1.2. When all agents in a multiplayer general-sum extensive-form game play so that their
Φ-regret relative to trigger deviation functions grows sublinearly, their empirical frequency of play
converges to an extensive-form correlated equilibrium of the game.
4. Φ = constant transformations. In this case, we are only requiring that the player not regret substituting
all of the strategies they played with the same strategy a. While this seems like an extremely restricted
notion of rationality, it actually turns out to be already extremely powerful. We will spend the rest of
this class to see why.
Fact 1.3. When all agents in a multiplayer general-sum game (normal-form or extensive-form) play
so that their external regret grows sublinearly, their empirical frequency of play converges to a
coarse correlated equilibrium of the game.
Fact 1.4. When all agents in a two-player zero-sum game (normal-form or extensive-form) play so
that their external regret grows sublinearly, their average strategies converge to a Nash equilibrium
of the game.
1.2 A very important special case: regret minimization
The special case where Φ is chosen to be the set of constant transformations only is so important that it
warrants its own special definition and notation.
Definition 1.2 (Regret minimizer). Let X be a set. An external regret minimizer for X —or simply “regret
minimizer for X ”—is a Φconst-regret minimizer for the special set of constant transformations
Φconst := {φˆx : x 7 → ˆx}ˆx∈X .
Its corresponding Φconst-regret is called “external regret” or simply “regret”, and it is indicated with the
symbol
RT := max
ˆx∈X
{ T∑
t=1
(
`t( ˆx) − `t(xt)
)}
. (2)
Once again, the goal for a regret minimizer is to have its cumulative regret RT grow sublinearly in T .
An important result in the subfield of online linear optimization asserts the existence of algorithms that
guarantee sublinear regret for any convex and compact domain X , typically of the order RT = O(√T )
asymptotically.
As it turns out, external regret minimization alone is enough to guarantee convergence to Nash equilibrium
in two-player zero-sum games, to coarse correlated equilibrium in multiplayer general-sum games, to best
responses to static stochastic opponents in multiplayer general-sum games, and much more. Before we delve
into those aspects, however, we first show another important property of regret minimization: general Φ-regret
minimization can reduced to it, in a precise sense.
1.3 From regret minimization to Φ-regret minimization
As we have seen, regret minimization is a very narrow instantiation of Φ-regret minimization—perhaps
the smallest sensible instantiation. Then, clearly, the problem of coming up with a regret minimizer for
a set X cannot be harder than the problem of coming up with a Φ-regret minimizer for X for richer sets
of transformation functions Φ. It might then seem surprising that there exists a construction that reduces
Φ-regret minimization to regret minimization.
More precisely, a result by Gordon et al. [2008] gives a way to construct a Φ-regret minimizer for X starting
from any regret minimizer for the set of functions Φ. The result goes as follows.
Theorem 1.1 (Gordon et al. [2008]). Let R be a deterministic regret minimizer for the set of transformations
Φ whose (external) cumulative regret RT grows sublinearly in T , and assume that every φ ∈ Φ admits a
fixed point φ(x) = x ∈ X . Then, a Φ-regret minimizer RΦ can be constructed starting from R as follows:
• Each call to RΦ.NextStrategy first calls NextStrategy on R to obtain the next transformation
φt. Then, a fixed point xt = φt(xt) is computed and output.
• Each call to RΦ.ObserveUtility(`t) with linear utility function `t constructs the linear utility
function Lt : φ 7 → `t(φ(xt)), where xt is the last-output strategy, and passes it to R by calling
R.ObserveUtility(Lt).b
Furthermore, the Φ-regret RT
Φ cumulated up to time T by RΦ we have just defined is exactly equal to
the (external) cumulative regret RT cumulated by R:
RT
Φ = RT ∀ T = 1, 2, . . . .
So, because the regret cumulated by R grows sublinearly by hypothesis, then so does the Φ-regret
cumulated by RΦ.
bOn the surface, it might look like Lt is independent on the last output φt of the regret minimizer R, and thus, that it
trivially satisfies the requirements of Definition 1.2. However, that is not true: xt is a fixed point of φt, and since xt enters
into the definition of Lt, if R picks φt randomly, it might very well be that Lt is not conditionally independent on φt. We
sidestep this issue by requiring that R is deterministic (cf. Footnote a).
Proof. The proof of correctness of the above construction is deceptively simple. Since R outputs transfor-
mations φ1, φ2, · · · ∈ Φ and receives utilities φ 7 → `1(φ(x1)), φ 7 → `2(φ(x2)), . . . , its cumulative regret RT is
by definition
RT = max
ˆφ∈Φ
{ T∑
t=1
(
`t( ˆφ(xt)) − `t(φt(xt))
)}
.
Now, since by construction xt is a fixed point of φt, φt(xt) = xt, and therefore we can write
RT = max
ˆφ∈Φ
{ T∑
t=1
(
`t( ˆφ(xt)) − `t(xt)
)}
, (3)
where the right-hand side is exactly the cumulative Φ-regret RT
Φ incurred by RΦ, as defined in (2).
2 Applications of regret minimization
In order to establish regret minimization as a meaningful abstraction for learning in games, we must check
that regret minimizing and Φ-regret minimizing dynamics indeed lead to “interesting” or expected behavior in
common situations.
2.1 Learning a best response against stochastic opponents
As a first smoke test, let’s verify that over time a regret minimizer would learn how to best respond to static,
stochastic opponents. Specifically, consider this scenario. We are playing a repeated n-player general-sum game
with multilinear utilities (this captures normal-form game and extensive-form games alike), where Players
i = 1, . . . , n − 1 play stochastically, that is, at each t they independently sample a strategy x(i),t ∈ X (i) from
the same fixed distribution (which is unknown to any other player). Formally, this means that
피[x(i),t] = ̄x(i) ∀ i = 1, . . . , n − 1, t = 1, 2, . . . .
Player n, on the other hand, is learning in the game, picking strategies according to some algorithm that
guarantees sublinear external regret, where the feedback observed by Player n at each time t is their own
linear utility function:
`t := X (n) 3 x(n) 7 → u(n)(x(1),t, . . . , x(n−1),t, x(n)).
Then, the average of the strategies played by Player n converges almost surely to a best response to
̄x(1), . . . , ̄x(n−1), that is,
1
T
T∑
t=1
x(n),t a.s.
−−−→ arg max
ˆx(n) ∈X (n)
{
u(n)( ̄x(1), . . . , ̄x(n−1), ˆx(n))
}
.
2.2 Self-play convergence to bilinear saddle points (such as a Nash equilibrium in a
two-player zero-sum game)
It turns out that regret minimization can be used to converge to bilinear saddle points, that is solutions to
problems of the form
max
x∈X min
y∈Y x>Ay, (4)
where X and Y are convex compact sets and A is a matrix. These types of optimization problems are pervasive
in game-theory. The canonical prototype of bilinear saddle point problem is the computation of Nash equilibria
in two-player zero-sum games (either normal-form or extensive-form). There, a Nash equilibrium is the solution
to (4) where X and Y are the strategy spaces of Player 1 and Player 2 respectively (probability simplexes for
normal-form games or sequence-form polytopes for extensive-form games), and A is the payoff matrix for
Player 1. Other examples include social-welfare-maximizing correlated equilibria and optimal strategies in
two-team zero-sum adversarial team games.
The idea behind using regret minimization to converge to bilinear saddle-point problems is to use self play.
We instantiate two regret minimization algorithms, RX and RY , for the domains of the maximization and
minimization problem, respectively. At each time t the two regret minimizers output strategies xt and yt,
respectively. Then, they receive feedback `t
X , `t
Y defined as
`t
X : x 7 → (Ayt)>x, `t
Y : y 7 → −(A>xt)>y.
We summarize the process pictorially in Figure 1.
RX
RY
`t−1
X
`t−1
Y
xt
yt `t
Y
`t
X RX
RY
xt+1
yt+1 · · ·· · ·
Figure 1: The flow of strategies and utilities in regret minimization for games. The symbol denotes
computation/construction of the utility function.
A well known folk theorem establish that the pair of average strategies produced by the regret minimizers
up to any time T converges to a saddle point of (4), where convergence is measured via the saddle point gap
0 ≤ γ(x, y) :=
(
max
ˆx∈X { ˆx>Ay} − x>Ay
)
+
(
x>Ay − min
ˆy∈Y {x>A ˆy}
)
= max
ˆx∈X { ˆx>Ay} − min
ˆx∈X {x>A ˆy}.
A point (x, y) ∈ X × Y has zero saddle point gap if and only if it is a solution to (4).
Theorem 2.1. Consider the self-play setup summarized in Figure 1, where RX and RY are regret
minimizers for the sets X and Y, respectively. Let RT
X and RT
Y be the (sublinear) regret cumulated by
RX and RY , respectively, up to time T , and let ̄xT and ̄yT denote the average of the strategies produced
up to time T . Then, the saddle point gap γ( ̄xT , ̄yT ) of ( ̄xT , ̄yT ) satisfies
γ( ̄xT , ̄yT ) ≤ RT
X + RT
Y
T → 0 as T → ∞.
Proof. By definition of regret,
RT
X + RT
Y
T = 1
T max
ˆx∈X
{ T∑
t=1
`t
X ( ˆx)
}
− 1
T
T∑
t=1
`t
X (xt) + 1
T max
ˆy∈Y
{ T∑
t=1
`t
Y ( ˆy)
}
− 1
T
T∑
t=1
`t
Y (yt)
= 1
T max
ˆx∈X
{ T∑
t=1
`t
X ( ˆx)
}
+ 1
T max
ˆy∈Y
{ T∑
t=1
`t
Y ( ˆy)
}
(since `t
X (xt) + `t
Y (yt) = 0)
= 1
T max
ˆx∈X
{ T∑
t=1
ˆx>Ayt
}
+ 1
T max
ˆy∈Y
{ T∑
t=1
−(xt)>A ˆy
}
= max
ˆx∈X
{ ˆx>A ̄yT } − min
ˆy∈Y
{( ̄xT )>A ˆy} = γ( ̄xT , ̄yT ).
References
Geoffrey J Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In Proceedings of
the International Conference on Machine Learning (ICML), pages 360–367. ACM, 2008.

Content

Typo or question?

Metadata