MIT 6.S890 โ Topics in Multiagent Learning Tue, Oct 8th 2024
Lecture 10
Learning in extensive-form games
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
1 Learning algorithms for extensive-form games
Several approaches for constructing no-regret algorithms for extensive-form games have been
proposed. For one, extensive-form games are a particular instance of combinatorial games for which
the multiplicative weights update algorithm can be implemented efficiently in the reduced normal
form of the game, despite the exponential size. We will see more details about this in a later class.
As explained in Lecture 9, the natural representation of strategies to define learning in extensive-
form games is the sequence-form representation. Indeed, in that representation utility functions are
linear and the strategy set of each player a convex polytope, aligning with the requirements of
the regret minimization framework. Thanks to the sequence form representation of strategiesall the
results about external regret minimization we have seen so far apply to extensive-form games as well,
including for example the fact that a Nash equilibrium in a two-player zero-sum game can be found
by letting two regret minimizers play against each other by exchanging equence-form strategies at
every iteration according to the canonical learning setup
โ๐ณ
โ๐ด
๐ข(๐กโ1)
๐ณ
๐ข(๐กโ1)
๐ด
๐ฅ(๐ก)
๐ฆ(๐ก) ๐ข(๐ก)
๐ด
๐ข(๐ก)
๐ณ
โ๐ณ
โ๐ด
๐ฅ(๐ก+1)
๐ฆ(๐ก+1)
Another example is the computation of coarse correlated equilibria in any multiplayer extensive-form
game via external regret minimization, or computation of best responses against static opponents.
To construct an external regret minimizer that outputs sequence-form strategies, several approaches
can be followed. For one, we have seen that one can always use the online projected gradient ascent
algorithm, which is a particular instantiation of the online mirror descent (OMD) algorithm. The
drawback of such approach is that it requires projecting onto the polytope of sequence form strategies,
which might be laborious. Alternative regularizers (i.e., distance-generating functions) that render
projection easier have been proposed. However, for today we focus on a different approach, which
has been extremely popular in practice: the counterfactual regret minimization (CFR) algorithm.
2 The CFR algorithm
The idea of the CFR algorithm is simple: construct a regret minimizer for the whole tree-form
problem starting from local regret minimizers at each decision point, each learning what actions to
play at that decision point.
Example 2.1. As an example, consider the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which we already introduced in Lecture 9. The black nodes are the decision
points of the player, and the white nodes are the observation points.
Since the player has six decision pointโdenoted ๐1, ..., ๐6 in the figureโthe CFR algorithm will
use six local regret minimizer, which we denote โ1, ..., โ6. Each regret minimizer โ๐ will be
responsible for outputting a local strategy ๐๐ โ ฮ(๐ด๐) for the decision point ๐.
The local distributions output by the different local regret minimizers is then combined to form a
sequence-form strategy that plays according to the local distributions at each decision point.
2.1 Where the magic happens: counterfactual utilities
What is the training signal that each local regret minimizer receives? In other words, what is the
utility that the regret minimizer at decision point ๐ observes? The answer is the counterfactual utility.
Remember that in the sequence form representation, the dimensionality of the strategy vectors
matches the number of actions controlled by the players. Hence, the gradient vector received by
the regret minimizer has one entry per each action controlled by the player, intuitively representing
whether the โprobability flowโ passing through that action scores well or poorly. The idea of
counterfactual utilities is to use as training signal for every โ๐ the vector of expected utilities in the
subtrees rooted at each of the actions ๐ โ ๐ด๐.
It can be shown that the regret cumulated by the CFR algorithm satisfies the following bound.
Theorem 2.1. Let Reg(๐ )
๐ (๐ โ ๐ฅ) denote the regret cumulated up to time ๐ by each of the regret
minimizers โ๐. Then, the regret Reg(๐ ) cumulated by Algorithm 1 up to time ๐ satisfies
Reg(๐ ) โค โ
๐โ๐ฅ
max{0, Reg(๐ )
๐ }.
It is then immediate to see that if each Reg(๐ )
๐ grows sublinearly in ๐ , then so does Reg(๐ ).
In order to formally introduce counterfactual utility, we recall a bit of notation to deal with tree-
form decision processes.
Notation for tree-form decision processes. We recall the following notation for dealing with tree-
form decision processes (TFDPs), which we introduced in Lecture 9. The notation is also summarized
in Table 1.
โข We denote the set of decision points in the TFDP as ๐ฅ, and the set of observation points as ๐ฆ.
At each decision point ๐ โ ๐ฅ, the agent selects an action from the set ๐ด๐ of available actions.
At each observation point ๐ โ ๐ฆ, the agent observes a signal ๐ ๐ from the environment out of
a set of possible signals ๐๐.
โข We denote by ๐ the transition function of the process. Picking action ๐ โ ๐ด๐ at decision point
๐ โ ๐ฅ results in the process transitioning to ๐(๐, ๐) โ ๐ฅ โช ๐ฆ โช {โฅ}, where โฅ denotes the end
of the decision process. Similarly, the process transitions to ๐(๐, ๐ ) โ ๐ฅ โช ๐ฆ โช {โฅ} after the
agent observes signal ๐ โ ๐๐ at observation point ๐ โ ๐ฆ.
โข A pair (๐, ๐) where ๐ โ ๐ฅ and ๐ โ ๐ด๐ is called a sequence. The set of all sequences is denoted as
ฮฃ โ {(๐, ๐) : ๐ โ ๐ฅ, ๐ โ ๐ด๐}. For notational convenience, we will often denote an element (๐, ๐)
in ฮฃ as ๐๐ without using parentheses.
โข Given a decision point ๐ โ ๐ฅ, we denote by ๐๐ its parent sequence, defined as the last sequence
(that is, decision point-action pair) encountered on the path from the root of the decision
process to ๐. If the agent does not act before ๐ (that is, ๐ is the root of the process or only
observation points are encountered on the path from the root to ๐), we let ๐๐ = โ.
Example 2.2. As an example, consider again the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which was also recalled above in Example 2.1. We have that ๐ฅ = {๐1, ..., ๐6} and
๐ฆ = {๐1, ..., ๐4}. We have:
๐ด๐1 = ๐๐4 = {check, raise}, ๐ด๐5 = {fold, call}, ๐๐1 = {jack, queen, king}
๐๐4 = (๐1, check), ๐๐6 = (๐3, check), ๐๐1 = ๐๐2 = ๐๐3 = โ.
Furthermore,
๐(๐3, check) = ๐(๐2, raise) =โฅ, ๐(๐1, king) = ๐3, ๐(๐2, check) = ๐3.
Notation for the components of vectors. Any vector ๐ฅ โ โฮฃ has, by definition, as many components
as sequences ฮฃ. The component corresponding to a specific sequence ๐๐ โ ฮฃ is denoted as ๐ฅ[๐๐].
Similarly, given any decision point ๐ โ ๐ฅ, any vector ๐ฅ โ โ๐ด๐ has as many components as the number
of actions at ๐. The component corresponding to a specific action ๐ โ ๐ด๐ is denoted ๐ฅ[๐].
Symbol Description
๐ฅ Set of decision points
๐ด๐ Set of legal actions at decision point ๐ โ ๐ฅ
๐ฆ Set of observation points
๐๐ Set of possible signals at observation point ๐ โ ๐ฆ
๐ Transition function:
โข given ๐ โ ๐ฅ and ๐ โ ๐ด๐, ๐(๐, ๐) returns the next decision or observation point ๐ฃ in
๐ฅ โช ๐ฆ in the decision tree that is reached after selecting legal action ๐ โ ๐, or โฅ
if the decision process ends;
โข given ๐ โ ๐ฆ and ๐ โ ๐๐ , ๐(๐, ๐ ) returns the next decision or observation point ๐ฃ โ
๐ฅ โช ๐พ in the decision tree that is reached after observing signal ๐ at ๐, or โฅ if the
decision process ends
ฮฃ Set of sequences, defined as ฮฃ โ {(๐, ๐) : ๐ โ ๐ฝ , ๐ โ ๐ด๐}
๐๐ Parent sequence of decision point ๐ โ ๐ฅ, defined as the last sequence (decision point-
action pair) on the path from the root of the TFDP to decision point ๐; if the agent
does not act before ๐, ๐๐ = โ
Table 1: Summary of notation for tree-form decision processes.
2.2 Pseudocode for CFR
Pseudocode for CFR is given in Algorithm 1. Note that the implementation is parametric on the
regret minimization algorithms โ๐ run locally at each decision point. Any regret minimizer โ๐ for
simplex domains can be used to solve the local regret minimization problems. Popular options are
the regret matching algorithm, and the regret matching plus algorithm (Lecture 5).
Algorithm 1: CFR regret minimizer
Data: โ๐ regret minimizer for ฮ(๐ด๐); one for each decision point ๐ โ ๐ฅ of the TFDP.
1 function NextStrategy()
[โน Step 1: we ask each of the โ๐ for their next strategy local at each decision point]
2 for each decision point ๐ โ ๐ฅ
3 ๐(๐ก)
๐ โ ฮ(๐ด๐) โ โ๐.NextStrategy()
[โน Step 2: we construct the sequence-form representation of the strategy that plays
according to the distribution ๐(๐ก)
๐ at each decision point ๐ โ ๐ฅ]
4 ๐ฅ(๐ก) = ๐ โ โฮฃ
5 for each decision point ๐ โ ๐ฅ in top-down traversal order in the TFDP
6 for each action ๐ โ ๐ด๐
7 if ๐๐ = โ
8 ๐ฅ(๐ก)[๐๐] โ ๐(๐ก)
๐ [๐]
9 else
10 ๐ฅ(๐ก)[๐๐] โ ๐ฅ(๐ก)[๐๐] ยท ๐(๐ก)
๐ [๐]
[โน You should convince yourself that the vector ๐ฅ(๐ก) we just filled in above is a valid
sequence-form strategy, that is, it satisfies the required consistency constraints we saw
in Lecture 9. In symbols, ๐ฅ(๐ก) โ ๐]
11 return ๐ฅ(๐ก)
12 function ObserveUtility(๐ข(๐ก) โ โฮฃ)
[โน Step 1: we compute the expected utility for each subtree rooted at each node ๐ฃ โ
๐ฅ โช ๐ฆ]
13 ๐ (๐ก) โ empty dictionary [โน eventually, it will map keys ๐ฅ โช ๐ฆ โช {โฅ} to real numbers]
14 ๐ (๐ก)[โฅ] โ 0
15 for each node in the tree ๐ฃ โ ๐ฅ โช ๐ฆ in bottom-up traversal order in the TFDP
16 if ๐ฃ โ ๐ฅ
17 let ๐ โ ๐ฃ
18 ๐ (๐ก)[๐] โ โ๐โ๐ด๐
๐(๐ก)
๐ [๐] ยท (๐ข(๐ก)[๐๐] + ๐ (๐ก)[๐(๐, ๐)])
19 else
20 let ๐ โ ๐ฃ
21 ๐ (๐ก)[๐] โ โ๐ โ๐๐
๐ (๐ก)[๐(๐, ๐ )]
[โน Step 2: at each decision point ๐ โ ๐ฅ, we now construct a local utility vector ๐ข(๐ก)
๐ called
counterfactual utility]
22 for each decision point ๐ โ ๐ฅ
23 ๐ข(๐ก)
๐ โ ๐ โ โ๐ด๐
24 for each action ๐ โ ๐ด๐
25 ๐ข(๐ก)
๐ [๐] โ ๐ข(๐ก)[๐๐] + ๐ (๐ก)[๐(๐, ๐)]
26 โ๐.ObserveUtility(๐ข(๐ก)
๐ )
Bibliography
[Kuh50] H. W. Kuhn, โA Simplified Two-Person Poker,โ Contributions to the Theory of Games,
vol. 1. in Annals of Mathematics Studies, 24, vol. 1. Princeton University Press, Princeton,
New Jersey, pp. 97โ103, 1950.
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
Lecture 10
Learning in extensive-form games
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)โ
1 Learning algorithms for extensive-form games
Several approaches for constructing no-regret algorithms for extensive-form games have been
proposed. For one, extensive-form games are a particular instance of combinatorial games for which
the multiplicative weights update algorithm can be implemented efficiently in the reduced normal
form of the game, despite the exponential size. We will see more details about this in a later class.
As explained in Lecture 9, the natural representation of strategies to define learning in extensive-
form games is the sequence-form representation. Indeed, in that representation utility functions are
linear and the strategy set of each player a convex polytope, aligning with the requirements of
the regret minimization framework. Thanks to the sequence form representation of strategiesall the
results about external regret minimization we have seen so far apply to extensive-form games as well,
including for example the fact that a Nash equilibrium in a two-player zero-sum game can be found
by letting two regret minimizers play against each other by exchanging equence-form strategies at
every iteration according to the canonical learning setup
โ๐ณ
โ๐ด
๐ข(๐กโ1)
๐ณ
๐ข(๐กโ1)
๐ด
๐ฅ(๐ก)
๐ฆ(๐ก) ๐ข(๐ก)
๐ด
๐ข(๐ก)
๐ณ
โ๐ณ
โ๐ด
๐ฅ(๐ก+1)
๐ฆ(๐ก+1)
Another example is the computation of coarse correlated equilibria in any multiplayer extensive-form
game via external regret minimization, or computation of best responses against static opponents.
To construct an external regret minimizer that outputs sequence-form strategies, several approaches
can be followed. For one, we have seen that one can always use the online projected gradient ascent
algorithm, which is a particular instantiation of the online mirror descent (OMD) algorithm. The
drawback of such approach is that it requires projecting onto the polytope of sequence form strategies,
which might be laborious. Alternative regularizers (i.e., distance-generating functions) that render
projection easier have been proposed. However, for today we focus on a different approach, which
has been extremely popular in practice: the counterfactual regret minimization (CFR) algorithm.
2 The CFR algorithm
The idea of the CFR algorithm is simple: construct a regret minimizer for the whole tree-form
problem starting from local regret minimizers at each decision point, each learning what actions to
play at that decision point.
Example 2.1. As an example, consider the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which we already introduced in Lecture 9. The black nodes are the decision
points of the player, and the white nodes are the observation points.
Since the player has six decision pointโdenoted ๐1, ..., ๐6 in the figureโthe CFR algorithm will
use six local regret minimizer, which we denote โ1, ..., โ6. Each regret minimizer โ๐ will be
responsible for outputting a local strategy ๐๐ โ ฮ(๐ด๐) for the decision point ๐.
The local distributions output by the different local regret minimizers is then combined to form a
sequence-form strategy that plays according to the local distributions at each decision point.
2.1 Where the magic happens: counterfactual utilities
What is the training signal that each local regret minimizer receives? In other words, what is the
utility that the regret minimizer at decision point ๐ observes? The answer is the counterfactual utility.
Remember that in the sequence form representation, the dimensionality of the strategy vectors
matches the number of actions controlled by the players. Hence, the gradient vector received by
the regret minimizer has one entry per each action controlled by the player, intuitively representing
whether the โprobability flowโ passing through that action scores well or poorly. The idea of
counterfactual utilities is to use as training signal for every โ๐ the vector of expected utilities in the
subtrees rooted at each of the actions ๐ โ ๐ด๐.
It can be shown that the regret cumulated by the CFR algorithm satisfies the following bound.
Theorem 2.1. Let Reg(๐ )
๐ (๐ โ ๐ฅ) denote the regret cumulated up to time ๐ by each of the regret
minimizers โ๐. Then, the regret Reg(๐ ) cumulated by Algorithm 1 up to time ๐ satisfies
Reg(๐ ) โค โ
๐โ๐ฅ
max{0, Reg(๐ )
๐ }.
It is then immediate to see that if each Reg(๐ )
๐ grows sublinearly in ๐ , then so does Reg(๐ ).
In order to formally introduce counterfactual utility, we recall a bit of notation to deal with tree-
form decision processes.
Notation for tree-form decision processes. We recall the following notation for dealing with tree-
form decision processes (TFDPs), which we introduced in Lecture 9. The notation is also summarized
in Table 1.
โข We denote the set of decision points in the TFDP as ๐ฅ, and the set of observation points as ๐ฆ.
At each decision point ๐ โ ๐ฅ, the agent selects an action from the set ๐ด๐ of available actions.
At each observation point ๐ โ ๐ฆ, the agent observes a signal ๐ ๐ from the environment out of
a set of possible signals ๐๐.
โข We denote by ๐ the transition function of the process. Picking action ๐ โ ๐ด๐ at decision point
๐ โ ๐ฅ results in the process transitioning to ๐(๐, ๐) โ ๐ฅ โช ๐ฆ โช {โฅ}, where โฅ denotes the end
of the decision process. Similarly, the process transitions to ๐(๐, ๐ ) โ ๐ฅ โช ๐ฆ โช {โฅ} after the
agent observes signal ๐ โ ๐๐ at observation point ๐ โ ๐ฆ.
โข A pair (๐, ๐) where ๐ โ ๐ฅ and ๐ โ ๐ด๐ is called a sequence. The set of all sequences is denoted as
ฮฃ โ {(๐, ๐) : ๐ โ ๐ฅ, ๐ โ ๐ด๐}. For notational convenience, we will often denote an element (๐, ๐)
in ฮฃ as ๐๐ without using parentheses.
โข Given a decision point ๐ โ ๐ฅ, we denote by ๐๐ its parent sequence, defined as the last sequence
(that is, decision point-action pair) encountered on the path from the root of the decision
process to ๐. If the agent does not act before ๐ (that is, ๐ is the root of the process or only
observation points are encountered on the path from the root to ๐), we let ๐๐ = โ.
Example 2.2. As an example, consider again the TFDP faced by Player 1 in the game of Kuhn
poker [Kuh50], which was also recalled above in Example 2.1. We have that ๐ฅ = {๐1, ..., ๐6} and
๐ฆ = {๐1, ..., ๐4}. We have:
๐ด๐1 = ๐๐4 = {check, raise}, ๐ด๐5 = {fold, call}, ๐๐1 = {jack, queen, king}
๐๐4 = (๐1, check), ๐๐6 = (๐3, check), ๐๐1 = ๐๐2 = ๐๐3 = โ.
Furthermore,
๐(๐3, check) = ๐(๐2, raise) =โฅ, ๐(๐1, king) = ๐3, ๐(๐2, check) = ๐3.
Notation for the components of vectors. Any vector ๐ฅ โ โฮฃ has, by definition, as many components
as sequences ฮฃ. The component corresponding to a specific sequence ๐๐ โ ฮฃ is denoted as ๐ฅ[๐๐].
Similarly, given any decision point ๐ โ ๐ฅ, any vector ๐ฅ โ โ๐ด๐ has as many components as the number
of actions at ๐. The component corresponding to a specific action ๐ โ ๐ด๐ is denoted ๐ฅ[๐].
Symbol Description
๐ฅ Set of decision points
๐ด๐ Set of legal actions at decision point ๐ โ ๐ฅ
๐ฆ Set of observation points
๐๐ Set of possible signals at observation point ๐ โ ๐ฆ
๐ Transition function:
โข given ๐ โ ๐ฅ and ๐ โ ๐ด๐, ๐(๐, ๐) returns the next decision or observation point ๐ฃ in
๐ฅ โช ๐ฆ in the decision tree that is reached after selecting legal action ๐ โ ๐, or โฅ
if the decision process ends;
โข given ๐ โ ๐ฆ and ๐ โ ๐๐ , ๐(๐, ๐ ) returns the next decision or observation point ๐ฃ โ
๐ฅ โช ๐พ in the decision tree that is reached after observing signal ๐ at ๐, or โฅ if the
decision process ends
ฮฃ Set of sequences, defined as ฮฃ โ {(๐, ๐) : ๐ โ ๐ฝ , ๐ โ ๐ด๐}
๐๐ Parent sequence of decision point ๐ โ ๐ฅ, defined as the last sequence (decision point-
action pair) on the path from the root of the TFDP to decision point ๐; if the agent
does not act before ๐, ๐๐ = โ
Table 1: Summary of notation for tree-form decision processes.
2.2 Pseudocode for CFR
Pseudocode for CFR is given in Algorithm 1. Note that the implementation is parametric on the
regret minimization algorithms โ๐ run locally at each decision point. Any regret minimizer โ๐ for
simplex domains can be used to solve the local regret minimization problems. Popular options are
the regret matching algorithm, and the regret matching plus algorithm (Lecture 5).
Algorithm 1: CFR regret minimizer
Data: โ๐ regret minimizer for ฮ(๐ด๐); one for each decision point ๐ โ ๐ฅ of the TFDP.
1 function NextStrategy()
[โน Step 1: we ask each of the โ๐ for their next strategy local at each decision point]
2 for each decision point ๐ โ ๐ฅ
3 ๐(๐ก)
๐ โ ฮ(๐ด๐) โ โ๐.NextStrategy()
[โน Step 2: we construct the sequence-form representation of the strategy that plays
according to the distribution ๐(๐ก)
๐ at each decision point ๐ โ ๐ฅ]
4 ๐ฅ(๐ก) = ๐ โ โฮฃ
5 for each decision point ๐ โ ๐ฅ in top-down traversal order in the TFDP
6 for each action ๐ โ ๐ด๐
7 if ๐๐ = โ
8 ๐ฅ(๐ก)[๐๐] โ ๐(๐ก)
๐ [๐]
9 else
10 ๐ฅ(๐ก)[๐๐] โ ๐ฅ(๐ก)[๐๐] ยท ๐(๐ก)
๐ [๐]
[โน You should convince yourself that the vector ๐ฅ(๐ก) we just filled in above is a valid
sequence-form strategy, that is, it satisfies the required consistency constraints we saw
in Lecture 9. In symbols, ๐ฅ(๐ก) โ ๐]
11 return ๐ฅ(๐ก)
12 function ObserveUtility(๐ข(๐ก) โ โฮฃ)
[โน Step 1: we compute the expected utility for each subtree rooted at each node ๐ฃ โ
๐ฅ โช ๐ฆ]
13 ๐ (๐ก) โ empty dictionary [โน eventually, it will map keys ๐ฅ โช ๐ฆ โช {โฅ} to real numbers]
14 ๐ (๐ก)[โฅ] โ 0
15 for each node in the tree ๐ฃ โ ๐ฅ โช ๐ฆ in bottom-up traversal order in the TFDP
16 if ๐ฃ โ ๐ฅ
17 let ๐ โ ๐ฃ
18 ๐ (๐ก)[๐] โ โ๐โ๐ด๐
๐(๐ก)
๐ [๐] ยท (๐ข(๐ก)[๐๐] + ๐ (๐ก)[๐(๐, ๐)])
19 else
20 let ๐ โ ๐ฃ
21 ๐ (๐ก)[๐] โ โ๐ โ๐๐
๐ (๐ก)[๐(๐, ๐ )]
[โน Step 2: at each decision point ๐ โ ๐ฅ, we now construct a local utility vector ๐ข(๐ก)
๐ called
counterfactual utility]
22 for each decision point ๐ โ ๐ฅ
23 ๐ข(๐ก)
๐ โ ๐ โ โ๐ด๐
24 for each action ๐ โ ๐ด๐
25 ๐ข(๐ก)
๐ [๐] โ ๐ข(๐ก)[๐๐] + ๐ (๐ก)[๐(๐, ๐)]
26 โ๐.ObserveUtility(๐ข(๐ก)
๐ )
Bibliography
[Kuh50] H. W. Kuhn, โA Simplified Two-Person Poker,โ Contributions to the Theory of Games,
vol. 1. in Annals of Mathematics Studies, 24, vol. 1. Princeton University Press, Princeton,
New Jersey, pp. 97โ103, 1950.
โ These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.