Gabriele Farina - Learning in extensive-form games

MIT 6.S890 — Topics in Multiagent Learning Tue, Oct 8th 2024
Lecture 10
Learning in extensive-form games
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
1 Learning algorithms for extensive-form games
Several approaches for constructing no-regret algorithms for extensive-form games have been
proposed. For one, extensive-form games are a particular instance of combinatorial games for which
the multiplicative weights update algorithm can be implemented efficiently in the reduced normal
form of the game, despite the exponential size. We will see more details about this in a later class.
As explained in Lecture 9, the natural representation of strategies to define learning in extensive-
form games is the sequence-form representation. Indeed, in that representation utility functions are
linear and the strategy set of each player a convex polytope, aligning with the requirements of
the regret minimization framework. Thanks to the sequence form representation of strategiesall the
results about external regret minimization we have seen so far apply to extensive-form games as well,
including for example the fact that a Nash equilibrium in a two-player zero-sum game can be found
by letting two regret minimizers play against each other by exchanging equence-form strategies at
every iteration according to the canonical learning setup
ℛ𝒳
ℛ𝒴
𝑢(𝑡−1)
𝒳
𝑢(𝑡−1)
𝒴
𝑥(𝑡)
𝑦(𝑡) 𝑢(𝑡)
𝒴
𝑢(𝑡)
𝒳
ℛ𝒳
ℛ𝒴
𝑥(𝑡+1)
𝑦(𝑡+1)
Another example is the computation of coarse correlated equilibria in any multiplayer extensive-form
game via external regret minimization, or computation of best responses against static opponents.
To construct an external regret minimizer that outputs sequence-form strategies, several approaches
can be followed. For one, we have seen that one can always use the online projected gradient ascent
algorithm, which is a particular instantiation of the online mirror descent (OMD) algorithm. The
drawback of such approach is that it requires projecting onto the polytope of sequence form strategies,
which might be laborious. Alternative regularizers (i.e., distance-generating functions) that render
projection easier have been proposed. However, for today we focus on a different approach, which
has been extremely popular in practice: the counterfactual regret minimization (CFR) algorithm.
2 The CFR algorithm
The idea of the CFR algorithm is simple: construct a regret minimizer for the whole tree-form
problem starting from local regret minimizers at each decision point, each learning what actions to
play at that decision point.
Example 2.1. As an example, consider the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which we already introduced in Lecture 9. The black nodes are the decision
points of the player, and the white nodes are the observation points.
Since the player has six decision point—denoted 𝑗1, ..., 𝑗6 in the figure—the CFR algorithm will
use six local regret minimizer, which we denote ℛ1, ..., ℛ6. Each regret minimizer ℛ𝑗 will be
responsible for outputting a local strategy 𝑏𝑗 ∈ Δ(𝐴𝑗) for the decision point 𝑗.
The local distributions output by the different local regret minimizers is then combined to form a
sequence-form strategy that plays according to the local distributions at each decision point.
2.1 Where the magic happens: counterfactual utilities
What is the training signal that each local regret minimizer receives? In other words, what is the
utility that the regret minimizer at decision point 𝑗 observes? The answer is the counterfactual utility.
Remember that in the sequence form representation, the dimensionality of the strategy vectors
matches the number of actions controlled by the players. Hence, the gradient vector received by
the regret minimizer has one entry per each action controlled by the player, intuitively representing
whether the “probability flow” passing through that action scores well or poorly. The idea of
counterfactual utilities is to use as training signal for every ℛ𝑗 the vector of expected utilities in the
subtrees rooted at each of the actions 𝑎 ∈ 𝐴𝑗.
It can be shown that the regret cumulated by the CFR algorithm satisfies the following bound.
Theorem 2.1. Let Reg(𝑇 )
𝑗 (𝑗 ∈ 𝒥) denote the regret cumulated up to time 𝑇 by each of the regret
minimizers ℛ𝑗. Then, the regret Reg(𝑇 ) cumulated by Algorithm 1 up to time 𝑇 satisfies
Reg(𝑇 ) ≤ ∑
𝑗∈𝒥
max{0, Reg(𝑇 )
𝑗 }.
It is then immediate to see that if each Reg(𝑇 )
𝑗 grows sublinearly in 𝑇 , then so does Reg(𝑇 ).
In order to formally introduce counterfactual utility, we recall a bit of notation to deal with tree-
form decision processes.
Notation for tree-form decision processes. We recall the following notation for dealing with tree-
form decision processes (TFDPs), which we introduced in Lecture 9. The notation is also summarized
in Table 1.
• We denote the set of decision points in the TFDP as 𝒥, and the set of observation points as 𝒦.
At each decision point 𝑗 ∈ 𝒥, the agent selects an action from the set 𝐴𝑗 of available actions.
At each observation point 𝑘 ∈ 𝒦, the agent observes a signal 𝑠𝑘 from the environment out of
a set of possible signals 𝑆𝑘.
• We denote by 𝜌 the transition function of the process. Picking action 𝑎 ∈ 𝐴𝑗 at decision point
𝑗 ∈ 𝒥 results in the process transitioning to 𝜌(𝑗, 𝑎) ∈ 𝒥 ∪ 𝒦 ∪ {⊥}, where ⊥ denotes the end
of the decision process. Similarly, the process transitions to 𝜌(𝑘, 𝑠) ∈ 𝒥 ∪ 𝒦 ∪ {⊥} after the
agent observes signal 𝑠 ∈ 𝑆𝑘 at observation point 𝑘 ∈ 𝒦.
• A pair (𝑗, 𝑎) where 𝑗 ∈ 𝒥 and 𝑎 ∈ 𝐴𝑗 is called a sequence. The set of all sequences is denoted as
Σ ≔ {(𝑗, 𝑎) : 𝑗 ∈ 𝒥, 𝑎 ∈ 𝐴𝑗}. For notational convenience, we will often denote an element (𝑗, 𝑎)
in Σ as 𝑗𝑎 without using parentheses.
• Given a decision point 𝑗 ∈ 𝒥, we denote by 𝑝𝑗 its parent sequence, defined as the last sequence
(that is, decision point-action pair) encountered on the path from the root of the decision
process to 𝑗. If the agent does not act before 𝑗 (that is, 𝑗 is the root of the process or only
observation points are encountered on the path from the root to 𝑗), we let 𝑝𝑗 = ⌀.
Example 2.2. As an example, consider again the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which was also recalled above in Example 2.1. We have that 𝒥 = {𝑗1, ..., 𝑗6} and
𝒦 = {𝑘1, ..., 𝑘4}. We have:
𝐴𝑗1 = 𝑆𝑘4 = {check, raise}, 𝐴𝑗5 = {fold, call}, 𝑆𝑘1 = {jack, queen, king}
𝑝𝑗4 = (𝑗1, check), 𝑝𝑗6 = (𝑗3, check), 𝑝𝑗1 = 𝑝𝑗2 = 𝑝𝑗3 = ⌀.
Furthermore,
𝜌(𝑘3, check) = 𝜌(𝑗2, raise) =⊥, 𝜌(𝑘1, king) = 𝑗3, 𝜌(𝑗2, check) = 𝑘3.
Notation for the components of vectors. Any vector 𝑥 ∈ ℝΣ has, by definition, as many components
as sequences Σ. The component corresponding to a specific sequence 𝑗𝑎 ∈ Σ is denoted as 𝑥[𝑗𝑎].
Similarly, given any decision point 𝑗 ∈ 𝒥, any vector 𝑥 ∈ ℝ𝐴𝑗 has as many components as the number
of actions at 𝑗. The component corresponding to a specific action 𝑎 ∈ 𝐴𝑗 is denoted 𝑥[𝑎].
Symbol Description
𝒥 Set of decision points
𝐴𝑗 Set of legal actions at decision point 𝑗 ∈ 𝒥
𝒦 Set of observation points
𝑆𝑘 Set of possible signals at observation point 𝑘 ∈ 𝒦
𝜌 Transition function:
• given 𝑗 ∈ 𝒥 and 𝑎 ∈ 𝐴𝑗, 𝜌(𝑗, 𝑎) returns the next decision or observation point 𝑣 in
𝒥 ∪ 𝒦 in the decision tree that is reached after selecting legal action 𝑎 ∈ 𝑗, or ⊥
if the decision process ends;
• given 𝑘 ∈ 𝒦 and 𝑠 ∈ 𝑆𝑘 , 𝜌(𝑘, 𝑠) returns the next decision or observation point 𝑣 ∈
𝒥 ∪ 𝐾 in the decision tree that is reached after observing signal 𝑠 at 𝑘, or ⊥ if the
decision process ends
Σ Set of sequences, defined as Σ ≔ {(𝑗, 𝑎) : 𝑗 ∈ 𝐽 , 𝑎 ∈ 𝐴𝑗}
𝑝𝑗 Parent sequence of decision point 𝑗 ∈ 𝒥, defined as the last sequence (decision point-
action pair) on the path from the root of the TFDP to decision point 𝑗; if the agent
does not act before 𝑗, 𝑝𝑗 = ⌀
Table 1: Summary of notation for tree-form decision processes.
2.2 Pseudocode for CFR
Pseudocode for CFR is given in Algorithm 1. Note that the implementation is parametric on the
regret minimization algorithms ℛ𝑗 run locally at each decision point. Any regret minimizer ℛ𝑗 for
simplex domains can be used to solve the local regret minimization problems. Popular options are
the regret matching algorithm, and the regret matching plus algorithm (Lecture 5).
Algorithm 1: CFR regret minimizer
Data: ℛ𝑗 regret minimizer for Δ(𝐴𝑗); one for each decision point 𝑗 ∈ 𝒥 of the TFDP.
1 function NextStrategy()
[▹ Step 1: we ask each of the ℛ𝑗 for their next strategy local at each decision point]
2 for each decision point 𝑗 ∈ 𝒥
3 𝑏(𝑡)
𝑗 ∈ Δ(𝐴𝑗) ← ℛ𝑗.NextStrategy()
[▹ Step 2: we construct the sequence-form representation of the strategy that plays
according to the distribution 𝑏(𝑡)
𝑗 at each decision point 𝑗 ∈ 𝒥]
4 𝑥(𝑡) = 𝟎 ∈ ℝΣ
5 for each decision point 𝑗 ∈ 𝒥 in top-down traversal order in the TFDP
6 for each action 𝑎 ∈ 𝐴𝑗
7 if 𝑝𝑗 = ⌀
8 𝑥(𝑡)[𝑗𝑎] ← 𝑏(𝑡)
𝑗 [𝑎]
9 else
10 𝑥(𝑡)[𝑗𝑎] ← 𝑥(𝑡)[𝑝𝑗] · 𝑏(𝑡)
𝑗 [𝑎]
[▹ You should convince yourself that the vector 𝑥(𝑡) we just filled in above is a valid
sequence-form strategy, that is, it satisfies the required consistency constraints we saw
in Lecture 9. In symbols, 𝑥(𝑡) ∈ 𝑄]
11 return 𝑥(𝑡)
12 function ObserveUtility(𝑢(𝑡) ∈ ℝΣ)
[▹ Step 1: we compute the expected utility for each subtree rooted at each node 𝑣 ∈
𝒥 ∪ 𝒦]
13 𝑉 (𝑡) ← empty dictionary [▹ eventually, it will map keys 𝒥 ∪ 𝒦 ∪ {⊥} to real numbers]
14 𝑉 (𝑡)[⊥] ← 0
15 for each node in the tree 𝑣 ∈ 𝒥 ∪ 𝒦 in bottom-up traversal order in the TFDP
16 if 𝑣 ∈ 𝒥
17 let 𝑗 ← 𝑣
18 𝑉 (𝑡)[𝑗] ← ∑𝑎∈𝐴𝑗
𝑏(𝑡)
𝑗 [𝑎] · (𝑢(𝑡)[𝑗𝑎] + 𝑉 (𝑡)[𝜌(𝑗, 𝑎)])
19 else
20 let 𝑘 ← 𝑣
21 𝑉 (𝑡)[𝑘] ← ∑𝑠∈𝑆𝑘
𝑉 (𝑡)[𝜌(𝑘, 𝑠)]
[▹ Step 2: at each decision point 𝑗 ∈ 𝒥, we now construct a local utility vector 𝑢(𝑡)
𝑗 called
counterfactual utility]
22 for each decision point 𝑗 ∈ 𝒥
23 𝑢(𝑡)
𝑗 ← 𝟎 ∈ ℝ𝐴𝑗
24 for each action 𝑎 ∈ 𝐴𝑗
25 𝑢(𝑡)
𝑗 [𝑎] ← 𝑢(𝑡)[𝑗𝑎] + 𝑉 (𝑡)[𝜌(𝑗, 𝑎)]
26 ℛ𝑗.ObserveUtility(𝑢(𝑡)
𝑗 )
Bibliography
[Kuh50] H. W. Kuhn, “A Simplified Two-Person Poker,” Contributions to the Theory of Games,
vol. 1. in Annals of Mathematics Studies, 24, vol. 1. Princeton University Press, Princeton,
New Jersey, pp. 97–103, 1950.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.

Content

Typo or question?

Metadata