Gabriele Farina - Near-optimal learning in imperfect-information sequential games (and other combinatorial convex settings)

MIT Theory Reading Group Fri, Apr 19th 2024
Near-optimal learning in
imperfect-information sequential games
(and other combinatorial convex settings)
Gabriele Farina ( gfarina@mit.edu)★
★I am grateful for any reports of typos.
1 Learning dynamics in normal-form games
The theory of learning in games fundamentally tackles the following natural question: Can agents
discover strong strategies in a game by repeating the game many times, over time improving their
strategies by moving in the direction of the gradient of their utility?
In particular, the theory of no-regret dynamics gives a quantitative meaning to the notion of “learn-
ing” by demanding that each player keep their regret as small as possible. To fix ideas, let’s consider
the usual model of normal-form games. These are multiplayer games in which each player has a finite
set of strategies 𝒜𝑖, and the utility of each player is a function of the strategies chosen by all players.
At each repetition 𝑡 = 1, 2, ... of the game, each player 𝑖 selects independently a distribution 𝜆(𝑡)
𝑖 ∈
Δ(𝒜𝑖) over their strategies. Since the players’ distributions are independent, the expected utility of
each player is a multilinear function
𝑢𝑖(𝜆(𝑡)
1 , ..., 𝜆(𝑡)
𝑛 ) = 𝑈𝑖 ⋅ 𝜆(𝑡)
1 ⋅ ... ⋅ 𝜆(𝑡)
𝑛 .
Given a node of play, the regret of each player 𝑖 is the difference between the utility they would have
obtained by playing a single, fixed distribution in hindsight, and the utility they actually obtained
by playing a distribution over strategies, that is,
Reg(𝑇 )
𝑖 ≔ max
̂
𝜆𝑖∈Δ(𝒜𝑖)
{∑
𝑇
𝑡=1
𝑢𝑖( ̂𝜆𝑖, 𝜆(𝑡)
−𝑖 ) − 𝑢𝑖(𝜆(𝑡)
𝑖 , 𝜆(𝑡)
−𝑖 )}.
The attractiveness of regret as a natural performance metric for learning in games stems from the
following connections with equilibrium computation.
Theorem 1.1. In a two-player zero-sum normal-form game, the product of the average distrib-
utions of play is an 𝜀-approximate Nash equilibrium, where
𝜀 ≤ 1
𝑇 ∑
𝑖∈{1,2}
Reg(𝑇 )
𝑖 .
This connection was pivotal in the development of superhuman AI in games, since learning dynamics
are typically first-order iterative methods that can be implemented and scaled much more efficiently
than linear programming.
However, the connection between regret and equilibrium is not limited to zero-sum games. In fact,
the connection extends to general-sum games as well, albeit with a slightly different guarantee.
Theorem 1.2. In an 𝑛-player general-sum normal-form game, the average product distribution
of play is an 𝜀-approximate coarse-correlated equilibrium, where
𝜀 ≤ 1
𝑇 max
𝑖∈[𝑛]
Reg(𝑇 )
𝑖 .
(We remark the inversion between “average” and “product”, and switch from sum to maximum.)
1.1 Multiplicative weights update and optimism
Arguably, the best-studied algorithm for learning in games is the multiplicative weights update
(MWU) algorithm. MWU prescribes that each player updates their distribution over strategies by
multiplying it by a factor that is proportional to the exponential of the utility gradient. The algo-
rithm is simple and has been shown to have strong guarantees in the context of no-regret learning
in games. It is optimal assuming that the other players might be playing adversarially, that is, with
the sole intent of increasing player 𝑖’s regret.
However, learning in games is usually a more benign setting, despite its obvious nonstationarity due
to the concurrent learning of all agents. In particular, when all players are trying to learn, one would
expect that their strategies are somewhat predictable, and that faster rates might be developed. This
intuition started a long line of research, which today is best represented by the theory of optimism
in learning dynamics. (More bibliographic remarks are available at the end.)
To fast forward to the current state of the art, the optimistic multiplicative weights update (OMWU)
algorithm is a simple extension of MWU, with superior regret guarantees. The difference with MWU
simply resides in the addition of a momentum term to the gradient. We present it next.
OMWU(player 𝑖):
1 𝜆(1)
𝑖 ≔ uniform distrib. over the player’s strategy set 𝒜𝑖
2 𝑔(0)
𝑖 ≔ 0 ∈ ℝ𝒜𝑖
3 for 𝑡 = 1, 2, ... do
4 play strategy 𝜆(𝑡)
𝑖
5 compute gradient 𝑔(𝑡)
𝑖 ≔ ∇𝜆𝑖 𝑢𝑖(𝜆(𝑡)
1 , ..., 𝜆(𝑡)
𝑛 )
6 compute optimistic gradient ̃ 𝑔(𝑡)
𝑖 ≔ 2𝑔(𝑡)
𝑖 − 𝑔(𝑡−1)
𝑖 // Non-optimistic is ̃ 𝑔(𝑡)
𝑖 ≔ 𝑔(𝑡)
𝑖
7 update 𝜆(𝑡+1)
𝑖 [𝑎] ∝ 𝜆(𝑡)
𝑖 [𝑎] ⋅ exp{𝜂 ⋅ ̃ 𝑔(𝑡)
𝑖 [𝑎]} ∀𝑎 ∈ 𝒜𝑖
1.2 Guarantees and near-optimal regret rates
When each player 𝑖 ∈ [𝑛] learns using OMWU with the same learning rate 𝜂 > 0, the following strong
properties are known to hold. For simplicity, we assume without loss of generality that all utilities
in the game are in the range [0, 1] (if not, rescaling all utilities to satisfy the condition leaves the
equilibria unchanged).
Theorem 1.3 (𝑂(𝑇 1/4) per-player regret; [Syr+15]). For all 𝑇 , if 𝜂 = 𝑇 −1/4
√𝑛−1 , the regret of each
player 𝑖 ∈ [𝑛] is bounded as
Reg(𝑇 )
𝑖 ≤ (4 + log|𝒜𝑖|)
√𝑛 − 1 ⋅ 𝑇 1/4.
By Theorem 1.2, this implies convergence to the set of coarse-correlated equilibria at a rate
of 𝑂𝑇 (𝑇 −3/4).
Theorem 1.4 (Near-optimal per-player regret; [DFG21]). There exist universal constants
𝐶, 𝐶′ > 1 so that, for all 𝑇 , if 𝜂 ≤ 1
𝐶𝑛 log4 𝑇 , the regret of each player 𝑖 ∈ [𝑛] is bounded as
Reg(𝑇 )
𝑖 ≤ log|𝒜𝑖|
𝜂 + 𝐶′ log 𝑇 .
By Theorem 1.2, this implies convergence to the set of coarse-correlated equilibria at a rate
of 𝑂𝑇 (log4(𝑇 )/𝑇 ).
Theorem 1.5 (Optimal regret sum; [Syr+15]). If 𝜂 ≤ 1√8(𝑛−1) , at all times 𝑇 the sum of the
players’ regrets satisfies
∑
𝑖∈[𝑛]
Reg(𝑇 )
𝑖 ≤ 𝑛
𝜂 max
𝑖∈[𝑛]
log|𝒜𝑖|.
By Theorem 1.1, this implies convergence to the set of Nash equilibria in two-player zero-sum
games at a rate of 𝑂𝑇 (1/𝑇 ).
1.3 The question for this talk
The analysis of the OMWU algorithm, especially when it comes to the more advanced results such
as the one by Daskalakis, C., Fishelson, M., & Golowich, N. [DFG21] mentioned in Theorem 1.4, is
extremely tied to the structure of normal-form games. What can be said about learning dynamics
for more expressive classes of games? Is it possible to extend the guarantees of OMWU to more
general settings?
When I set out to answer these questions, I was particularly interested in extending the guarantees
of OMWU to the setting of imperfect-information sequential games. These games are significantly
more complex than normal-form games, in that they also capture turns and imperfect information.
Equilibria in these games are much more difficult to compute and require misdirecting behavior
(such as bluffing) at equilibrium. Can we extend the guarantees of OMWU to the setting of imperfect-
information sequential games and other combinatorially structured games?
2 Extensive-form games (EFGs)
The standard representation of an imperfect-information extensive-form game is through its game
tree, which formalizes the interaction of the players as a directed tree. In the game tree, each non-
terminal node belongs to exactly one player, who acts at the node by picking one of the outgoing
edges (each labeled with an action name). Imperfect information is captured in this representation
by partitioning the nodes of each player into sets (called information sets) of nodes that are indis-
tinguishable to that player given his or her observations.
2.1 The model
We now provide more details and notation regarding extensive-form games.
■ Histories, actions, and payoffs. The game tree represents the strategic interaction of players as a
finite directed tree. Each node that is not a leaf of the game tree is associated with a unique acting
player. In an 𝑛-player game, the set of valid players is the set [𝑛] = {1, ..., 𝑛}. One of these players
might serve the role of a chance (or nature) player—a fictitious player that selects actions according
to a known, fixed probability distribution and models exogenous stochasticity of the environment
(for example, a roll of the dice, or drawing a card from a deck). Such a chance player does not
participate in the learning and has no incentives (that is, it is excluded from consideration when
defining equilibria).
The players keep acting until a leaf of the game tree—called a terminal node—is reached. Terminal
nodes are not associated with any acting player; the set of terminal nodes is denoted 𝒵. When the
game transitions to a terminal node 𝑧 ∈ 𝒵, each player 𝑖 ∈ [𝑛] receives a payoff according to some
payoff function 𝒵 → ℝ.
■ Imperfect information and information sets. To model imperfect information, the nodes of each
player 𝑖 ∈ [𝑛] are partitioned into a collection ℐ𝑖 of so-called information sets. Each information set
𝐼 ∈ ℐ𝑖 groups together nodes that Player 𝑖 cannot distinguish between when he or she acts there.
In the limit case in which all information sets are a singleton, the player never has any uncertainty
about which node they are acting at, and the game is said to have perfect information.
Since a player always knows what actions are available at a decision node, any two nodes ℎ, ℎ′ be-
longing to the same information set 𝐼 must have the same set of available actions. Correspondingly,
we can write 𝒜𝐼 to denote the set of actions available at any node that belongs to information set 𝐼.
Example 2.1. For example, consider the following two-player game, where black nodes belong
to Player 1 and white nodes belong to Player 2.
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
The information set D for Player 1 encodes the fact that Player 1 does not observe Player 2′s
action at Q.
Since nodes in the same information set are indistinguishable to the player, every strategy for the
player must select the same action at all nodes in the same information set.
■ Perfect recall. As is standard in the literature, we assume that the game has perfect recall, that is,
information sets satisfy the the fact that no player forgets about their actions, and about information
once acquired.¹
¹The condition of perfect recall is formalized as follows. A player 𝑖 ∈ [𝑛] is said to have perfect recall if, for any
information set 𝐼 ∈ ℐ𝑖, for any two nodes ℎ, ℎ′ ∈ 𝐼 the sequence of Player 𝑖’s actions encountered along the path from
the root to ℎ and from the root to ℎ′ must coincide (or otherwise, Player 𝑖 would be able to distinguish among the
nodes, since the player remembers all of the actions they played in the past). The game is perfect recall if all players
have perfect recall.
2.2 Reducing an extensive-form game to a normal-form game
What does it mean to have a mixed strategy for an extensive-form game? One classical answer is
the following. Consider a player, and imagine enumerating all their deterministic strategies for the
tree. A mixed strategy is then a probability distribution over these deterministic strategies.
Example 2.2. In the small game of Example 2.1, a mixed strategy for Player 1 is a probability
distribution over the following 7 strategies 𝑣1, ..., 𝑣7.
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣1 ≔ ( 1 2 3 4 5 6 7 8 9
1 0 1 0 1 0 0 0 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣2 ≔ ( 1 2 3 4 5 6 7 8 9
1 0 1 0 0 1 0 0 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣3 ≔ ( 1 2 3 4 5 6 7 8 9
1 0 0 1 1 0 0 0 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣4 ≔ ( 1 2 3 4 5 6 7 8 9
1 0 0 1 0 1 0 0 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣5 ≔ ( 1 2 3 4 5 6 7 8 9
0 1 0 0 0 0 1 0 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣6 ≔ ( 1 2 3 4 5 6 7 8 9
0 1 0 0 0 0 0 1 0 )
A
P Q
B C D
1 2
3 4 5 6 7 98 7 98
𝑣7 ≔ ( 1 2 3 4 5 6 7 8 9
0 1 0 0 0 0 0 0 1 )
These strategies are called the reduced² normal-form plans of the player. We denote the set of
all deterministic strategies of Player 𝑖 as 𝒱𝑖.
²The term “reduced” refers to the fact that no actions are specified for parts of the games not reached due to
decisions of the player in higher parts of the game tree.
By considering the normal-form game in which each player’s strategy space is the set of all deter-
ministic strategies in the tree, we have converted the extensive-form game into its normal-form
equivalent.
Of course, the most glaring issue with this representation is that the number of strategies in the
normal-form game is exponential in the size of the game tree. For this reason, it was long believed
that operating with this normal-form representation was computationally infeasible. Historically,
this led to the development of specialized algorithms for extensive-form games, such as CFR and its
variants. As we show next, this belief was unfounded. In fact, it is possible to simulate the OMWU
dynamics in the normal-form equivalent of an extensive-form game exactly in polynomial time.
3 Learning in the normal form: Kernelized multiplicative weights for
0/1-polyhedral games
Contrary to the common belief that learning in the normal-form equivalent of an extensive-form
game is computationally infeasible, we show that the OMWU algorithm can be implemented in
polynomial time in the normal-form equivalent of an extensive-form game.
3.1 The convex game structure of the normal-form equivalent
The key insight is that the payoff matrix of the normal-form equivalent is heavily structured. In
particular, we show that the specific representation of strategies used in Example 2.2 allows us to
write the expected utility of each player as a multilinear function of the expected strategies of all
players. (We call a game with this property a convex game.)
Theorem 3.1. Given any choice of deterministic strategies 𝑣1, ..., 𝑣𝑛 for each player, the utility
of each player in the game is a multilinear function
𝑢𝑖(𝑣1, ..., 𝑣𝑛) = 𝑈𝑖 ⋅ 𝑣1 ⋅ ... ⋅ 𝑣𝑛
of the strategies, where the tensor 𝑈𝑖 has a number of nonzeros upper bounded by the number
of leaves of the game tree.
Proof . For any player’s strategy 𝑣 ∈ 𝒱𝑖, we have that 𝑣[𝑘] = 1 if and only if all actions of Player 𝑖
on the path from the root of the game down to edge 𝑘 included are selected. Hence, the indicator
of whether a terminal node 𝑧 ∈ 𝒵 is reached is the product of one entry of 𝑣𝑗 for all 𝑗 ∈ [𝑛].
Considering that the utility of the player is defined as the sum over all terminal nodes of the
indicator of whether the terminal node is reached times the player’s utility at the terminal node,
we obtain the result. □
Let now 𝜆𝑖 ∈ Δ(𝒱𝑖) be independent mixed strategies for each player 𝑖. Then, the expected utility of
each player 𝑖 is the mulitilinear function of the expected normal-form strategies
𝔼𝑣𝑗~𝜆𝑗 [𝑢𝑖(𝑣1, ..., 𝑣𝑛)] = 𝑈𝑖 ⋅ 𝔼𝑣1~𝜆1 [𝑣1] ⋅ ... ⋅ 𝔼𝑣𝑛~𝜆𝑛 [𝑣𝑛]
= 𝑈𝑖 ⋅ ( ∑
𝑣1∈𝒱1
𝜆1[𝑣1]𝑣1) ⋅ ... ⋅ ( ∑
𝑣𝑛∈𝒱𝑛
𝜆𝑛[𝑣𝑛]𝑣𝑛). (1)
So, in particular, the gradient of the expected utility of each player 𝑖 with respect to their mixed
strategy 𝜆𝑖 is
∇𝜆𝑖 𝔼𝑣𝑗~𝜆𝑗 [𝑢𝑖(𝑣1, ..., 𝑣𝑛)] =
⎝
⎜⎜⎛ ⋮
⟨𝑔𝑖, 𝑣⟩
⋮ ⎠
⎟⎟⎞
𝑣∈𝒱𝑖
,
where
ℝ𝑑𝑖 ∋ 𝑔𝑖 ≔ 𝑈𝑖 ⋅ 𝑣1 ⋅ ... ⋅ 𝑣𝑖−1 ⋅ 𝑣𝑖+1 ⋅ ... ⋅ 𝑣𝑛, where 𝑣𝑗 ≔
⎝
⎜⎛ ∑
𝑣𝑗∈𝒱𝑗
𝜆𝑗[𝑣𝑗]𝑣𝑗
⎠
⎟⎞.
With this in mind, the OMWU algorithm over the strategies in the extensive-form games can be
written in the following form.
EFG-OMWU(player 𝑖):
1 𝜆(1)
𝑖 ≔ uniform distribution over 𝒱𝑖
2 𝑔(0)
𝑖 ≔ 0 ∈ ℝ𝒱𝑖
3 for 𝑡 = 1, 2, ... do
4 play strategy 𝜆(𝑡)
𝑖 ∈ Δ(𝒱𝑖)
5 compute expectations 𝑣(𝑡)
𝑗 ≔ ∑𝑣𝑗∈𝒱𝑗
(𝜆(𝑡)
𝑗 [𝑣𝑗] ⋅ 𝑣𝑗) ∀𝑗 ∈ [𝑛] // ???
6 compute gradient 𝑔(𝑡)
𝑖 ≔ 𝑈𝑖 ⋅ 𝑣(𝑡)
1 ... ⋅ 𝑣(𝑡)
𝑖−1 ⋅ 𝑣(𝑡)
𝑖+1 ⋅ ... ⋅ 𝑣(𝑡)
𝑛 // Poly time in tree size
7 compute optimistic gradient ̃ 𝑔(𝑡)
𝑖 ≔ 2𝑔(𝑡)
𝑖 − 𝑔(𝑡−1)
𝑖 // Poly time in tree size
8 update 𝜆(𝑡+1)
𝑖 [𝑣] ∝ 𝜆(𝑡)
𝑖 [𝑣] ⋅ exp{𝜂 ⋅ ⟨ ̃𝑔(𝑡)
𝑖 , 𝑣⟩} for all 𝑣 ∈ 𝒱𝑖 // ???
3.2 Main result
For ease of notation, from now we will focus on a generic player 𝑖 ∈ [𝑛], and drop all subscripts 𝑖
from the notation. The main result that we seek to prove is the following.
Theorem 3.2 (Main theorem). There exists a kernel function 𝐾𝒱 : ℝ𝑑 × ℝ𝑑 → ℝ, which depends
on the combinatorial structure of the reduced normal-form strategies 𝒱 of the player, such that
the EFG-OMWU can be implemented using 𝑑 + 1 evaluations of 𝐾𝒱 at each iteration.
Note that Theorem 3.2 is crucially independent on the number of strategies |𝒱|, and only depends
on the number of player’s actions 𝑑 (which is polynomial in the size of the input game tree)!
Hence, as long as the kernel function can be evaluated efficiently, then EFG-OMWU can be simu-
lated efficiently too.
3.3 The 0/1-polyhedral kernel
In order to introduce the notion of 0/1-polyhedral kernel, we need to first introduce the notion of
0/1-polyhedral feature map.
Definition 3.1 (0/1-polyhedral feature map). The 0/1-polyhedral feature map 𝜙𝒱 is defined as
𝜙𝒱 : ℝ𝑑 → ℝ𝒱, 𝜙𝒱(𝑥)[𝑣] ≔ ∏
𝑘:𝑣[𝑘]=1
𝑥[𝑘] ∀𝑣 ∈ 𝒱.
As is common with kernels, we define the 0/1-polyhedral kernel as the inner product of the feature
maps.
Definition 3.2 (0/1-polyhedral kernel).
𝐾𝒱 : ℝ𝑑 × ℝ𝑑 → ℝ, 𝐾𝒱(𝑥, 𝑦) ≔ ⟨𝜙𝒱(𝑥), 𝜙𝒱(𝑦)⟩ = ∑
𝑣∈𝒱
∏
𝑘:𝑣[𝑘]=1
𝑥[𝑘]𝑦[𝑘].
3.4 Keeping track of the distribution over vertices
We start from showing how the 0/1-polyhedral kernel can help implement Line 8 of EFG-OMWU
efficiently. The key insight is that the strategies 𝜆(𝑡) computed by EFG-OMWU are fully captured
by the feature map of a low-dimensional vector at all iterations.
Theorem 3.3. At all times 𝑡, the distribution 𝜆(𝑡) over strategies 𝒱 computed by EFG-OMWU
is proportional to the feature map of the vector
𝑏(𝑡) ≔ exp{𝜂 ∑
𝑡−1
𝜏=1
̃
𝑔(𝜏)}.
Proof . We prove the result by induction over the time 𝑡.
• Base case. At time 𝑡 = 1, we have
𝑏(1) = 1 ⟹ 𝜙𝒱(𝑏(1)) = 1 ∝
1
|𝒱| = 𝜆(1).
• Inductive step. At time 𝑡 + 1, the probability of the strategy 𝑣 ∈ 𝒱 computed by EFG-
OMWU is
𝜆(𝑡+1)[𝑣] ∝ 𝜆(𝑡)[𝑣] ⋅ exp{𝜂⟨ ̃𝑔(𝑡), 𝑣⟩}.
The key insight now is that since 𝑣 ∈ {0, 1}𝑑, then
exp{𝜂⟨ ̃𝑔(𝑡), 𝑣⟩} = exp
⎩{⎨
{⎧
𝜂 ∑
𝑘:𝑣[𝑘]=1
̃
𝑔(𝑡)[𝑘]
⎭}⎬
}⎫
= ∏
𝑘:𝑣[𝑘]=1
exp{𝜂 ̃ 𝑔(𝑡)[𝑘]}.
Substituting the inductive hypothesis,
𝜆(𝑡+1)[𝑣] ∝ 𝜙𝒱(𝑏(𝑡))[𝑣] ⋅ ∏
𝑘:𝑣[𝑘]=1
exp{𝜂 ̃ 𝑔(𝑡)[𝑘]}
=
⎝
⎜⎛ ∏
𝑘:𝑣[𝑘]=1
𝑏(𝑡)[𝑘]
⎠
⎟⎞ ⋅
⎝
⎜⎛ ∏
𝑘:𝑣[𝑘]=1
exp{𝜂 ̃ 𝑔(𝑡)[𝑘]}
⎠
⎟⎞
= ∏
𝑘:𝑣[𝑘]=1
𝑏(𝑡+1)[𝑘]
= 𝜙𝒱(𝑏(𝑡+1))[𝑣],
completing the proof. □
In fact, we can even slightly refine the previous result by quantifying exactly the proportionality
constant. We do so in the next corollary.
Corollary 3.1. At all times 𝑡, one has
𝜆(𝑡) =
𝜙𝒱(𝑏(𝑡))
𝐾𝒱(𝑏(𝑡), 1)
.
Proof . Since we know from Theorem 3.3 that 𝜆(𝑡)[𝑣] ∝ 𝜙𝒱(𝑏(𝑡))[𝑣], and the sum ∑𝑣∈𝒱 𝜆(𝑡)[𝑣] =
1, the proportionality constant must be the inverse of
∑
𝑣∈𝒱
𝜙𝒱(𝑏(𝑡))[𝑣] = ∑
𝑣∈𝒱
𝜙𝒱(𝑏(𝑡))[𝑣] ⋅ 1 = ∑
𝑣∈𝒱
𝜙𝒱(𝑏(𝑡))[𝑣] ⋅ 𝜙𝒱(1)[𝑣] = 𝐾𝒱(𝑏(𝑡), 1),
completing the proof. □
3.5 Reconstructing the expectation
We now show how one can reconstruct the expectation
∑
𝑣∈𝒱
(𝜆(𝑡)[𝑣] ⋅ 𝑣)
which is needed to compute the utility gradient in the EFG-OMWU algorithm (Line 5). The key
is provided in the next theorem, which extends a nice insight by Takimoto, E., & Warmuth, M.
K. [TW03] .
Theorem 3.4. At all times 𝑡, the expectation ∑𝑣∈𝒱 𝜆(𝑡)[𝑣] ⋅ 𝑣 can be computed via 𝑑 + 1 kernel
computations according to the formula
∑
𝑣∈𝒱
(𝜆(𝑡)[𝑣] ⋅ 𝑣) = (1 −
𝐾𝒱(𝑏(𝑡), 𝑒1)
𝐾𝒱(𝑏(𝑡), 1)
, ..., 1 − 𝐾𝒱(𝑏(𝑡), 𝑒𝑑)
𝐾𝒱(𝑏(𝑡), 1)
),
where
𝑒𝑘 ≔ (1, ..., 1, 0, 1, ..., 1) = 1 − 𝑒𝑘 ∈ ℝ𝑑
is the vector that is equal to 1 in all coordinates except the 𝑖-th one where it is zero.
Proof . The key insight is that, for all 𝑘 ∈ [𝑑],
𝜙𝒱(𝑒𝑘)[𝑣] = ∏
𝑗:𝑣[𝑗]=1
𝑒𝑘[𝑗] = {0 if 𝑣[𝑘] = 1
1 otherwise
= 1 − 𝑣[𝑘].
So,
(𝜙𝒱(1) − 𝜙𝒱(𝑒𝑘))[𝑣] = 𝜙𝒱(1)[𝑣] − 𝜙𝒱(𝑒𝑘)[𝑣] = 1 − (1 − 𝑣[𝑘]) = 𝑣[𝑘],
and we can write, for all 𝑘 ∈ [𝑑],
(∑
𝑣∈𝒱
𝜆(𝑡)[𝑣] ⋅ 𝑣)[𝑘] = ∑
𝑣∈𝒱
(𝜆(𝑡)[𝑣] ⋅ (𝜙𝒱(1) − 𝜙𝒱(𝑒𝑘))[𝑣])
= 1
𝐾𝒱(𝑏(𝑡), 1)
∑
𝑣∈𝒱
𝜙𝒱(𝑏(𝑡))[𝑣] ⋅ (𝜙𝒱(1)[𝑣] − 𝜙𝒱(𝑒𝑘)[𝑣])
= 1
𝐾𝒱(𝑏(𝑡), 1)
(𝐾𝒱(𝑏(𝑡), 1) − 𝐾𝒱(𝑏(𝑡), 𝑒𝑘)) = 1 −
𝐾𝒱(𝑏(𝑡), 𝑒𝑘)
𝐾𝒱(𝑏(𝑡), 1)
,
which is the statement. □
3.6 Kernel computation in EFGs
Shows that, as long as the kernel can be computed efficiently, then the expectation can be com-
puted efficiently as well. In particular, as we will see in todo{thm:komwu in imperfect-information
extensive-form games}, KOMWU can be implemented with linear-time iterations in the number of
sequences 𝑑𝑖.
■ Worst-case linear complexity for a single evaluation. We start by verifying that the sequence-
form kernel can be evaluated in linear time for any pair of points 𝑥, 𝑦 ∈ ℝ𝑑𝑖 . To do so, we introduce
a partial kernel function 𝐾𝐼 : ℝ𝑑𝑖 × ℝ𝑑𝑖 → ℝ for every information set 𝐼 ∈ ℐ,
𝐾𝐼 : ℝ𝑑𝑖 × ℝ𝑑𝑖 → ℝ, 𝐾𝐼 (𝑥, 𝑦) ≔ ∑
𝑣∈𝒱≽𝐼
∏
𝑘:𝑣[𝑘]=1
𝑥[𝑘]𝑦[𝑘].
where 𝒱≽𝐼 denotes the projection of the strategy set 𝒱 onto only those actions at 𝐼 or below (re-
moving duplicates). We have the following.
Theorem 3.5. For any vectors 𝑥, 𝑦 ∈ ℝ𝑑𝑖 , the two following recursive relationships hold:
𝐾𝒱(𝑥, 𝑦) = ∏
𝐼∈ℐtop
𝐾𝐼 (𝑥, 𝑦), (2)
where ℐtop denotes the “top-level” information sets (those information sets that contain nodes
where the player is acting for the first time), and, for all information sets 𝐼 ∈ ℐ,
𝐾𝐼 (𝑥, 𝑦) = ∑
𝑎∈𝒜(𝐼)⎝
⎜⎛𝑥[𝐼𝑎]𝑦[𝐼𝑎] ∏
𝐼′∈𝒞(𝐼)
𝐾𝐼′ (𝑥, 𝑦)
⎠
⎟⎞, (3)
where 𝒞(𝐼) denotes those information sets that are immediate successors of 𝐼. In particular,
(2) and (3) give a recursive algorithm to evaluate the polyhedral kernel 𝐾𝒱 associated with the
strategy space of any player 𝑖 in an imperfect-information extensive-form game in linear time
in 𝑑𝑖.
Corollary 3.2. For each player 𝑖, EFG-OMWU can be implemented in polynomial time in the
size of the game tree. As a consequence, all the regret guarantees of OMWU can be extended
to the setting of imperfect-information extensive-form games as a black box.
We also make the following tangential remark.
Remark 3.1. Surprisingly, even for the non-optimistic version the kernelized MWU algorithm
achieves better regret bounds than all prior algorithms for learning in extensive-form games.
■ Amortizated constant-time kernel computation. We can actually refine the result above by showing
an implementation of EFG-OMWU with linear-time per-iteration complexity in the size of the game
tree, by exploiting the structure of the particular set of kernel evaluations needed at every iteration
and amortizing computation of the 𝑑 + 1 kernels required by EFG-OMWU at each iteration.
3.7 Applicability beyond extensive-form games
The kernelized approach we have introduced is not limited to extensive-form games. In fact, it can
be applied to any 0/1-polyhedral game, that is, a game where
• Each player 𝑖 has a strategy set 𝒱𝑖 ⊆ {0, 1}𝑑𝑖 ; and
• The expected utility of each player is multilinear in the expectation of the
strategies of all players.
Kernelized OMWU guarantees that for each player, 𝑑𝑖 + 1 kernel computations are sufficient at each
iteration to recover the expectated strategy and simulate OMWU in the normal-form game defined
by the 𝒱𝑖. In the original paper, we show that the kernel can be computed efficiently for a number
of combinatorial domains.
■ Binary strings. When 𝒱 = {0, 1}𝑑, then
𝐾𝒱(𝑥, 𝑦) = (1 + 𝑥[1] 𝑦[1])(1 + 𝑥[2] 𝑦[2]) ⋅ ... ⋅ (1 + 𝑥[𝑑] 𝑦[𝑑])
can be evaluated in linear time in 𝑑.
■ N-sets. When 𝒱 = {𝑥 ∈ {0, 1}𝑑 : ‖𝑥‖1 = 𝑛}, the kernel can be computed efficiently via dynamic
programming.
■ Set of flows in a DAG. In this case, the kernel can be computed in linear time in the number of
DAG edgs by performing dynamic programming on the topological order of the DAG. This case was
already covered by Takimoto, E., & Warmuth, M. K. [TW03], though we remark that in kernelized
OMWU the concept of weight pushing is not needed, so the final algorithms are slightly different.
4 Further material and bibliography
The utilities 𝑢(𝑡) that arise while learning games are structured and often have nice properties—
for example changing slowly over time. It is then natural to wonder what improved guarantees
can be achieved in games specifically, and consequently how fast convergence to equilibrium can
be guaranteed. This fundamental question was first formulated and addressed by Daskalakis, C.,
Deckelbaum, A., & Kim, A. [DDK11] within the context of zero-sum games. Since then, there has
been a considerable interest in extending their guarantee to more general settings (Chen, X., & Peng,
B. [CP20]; Daskalakis, C., & Golowich, N. [DG22]; Foster, D. J., Li, Z., Lykouris, T., Sridharan, K.,
& Tardos, É. [Fos+16]; Piliouras, G., Sim, R., & Skoulakis, S. [PSS22]; Rakhlin, S., & Sridharan, K.
[RS13]; Syrgkanis, V., Agarwal, A., Luo, H., & Schapire, R. E. [Syr+15]). As mentioned, Daskalakis,
C., Fishelson, M., & Golowich, N. [DFG21] established that when all players in a general normal-
form game employ OMWU, the regret of each player grows nearly-optimally as 𝑂(log4 𝑇 ) after 𝑇
repetitions of the game, leading to an exponential improvement over the guarantees obtained using
traditional techniques within the no-regret framework.
The kernelized OMWU algorithm we have introduced in this talk was introduced in Farina, G., Lee,
C.-W., Luo, H., & Kroer, C. [Far+22].
Bibliography
[Syr+15] V. Syrgkanis, A. Agarwal, H. Luo, and R. E. Schapire, “Fast convergence of regularized
learning in games,” in Advances in Neural Information Processing Systems, 2015.
[DFG21] C. Daskalakis, M. Fishelson, and N. Golowich, “Near-Optimal No-Regret Learning in
General Games,” CoRR, 2021.
[TW03] E. Takimoto and M. K. Warmuth, “Path kernels and multiplicative updates,” Journal of
Machine Learning Research, vol. 4, pp. 773–818, 2003.
[DDK11] C. Daskalakis, A. Deckelbaum, and A. Kim, “Near-optimal no-regret algorithms for zero-
sum games,” in Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA), 2011.
[CP20] X. Chen and B. Peng, “Hedging in games: Faster convergence of external and swap re-
grets,” in Proceedings of the Annual Conference on Neural Information Processing Systems
(NeurIPS), 2020.
[DG22] C. Daskalakis and N. Golowich, “Fast rates for nonparametric online learning: from real-
izability to learning in games,” in Proceedings of the Annual Symposium on Theory of
Computing (STOC), 2022.
[Fos+16] D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and É. Tardos, “Learning in Games: Ro-
bustness of Fast Convergence,” in Proceedings of the Annual Conference on Neural Infor-
mation Processing Systems (NIPS), 2016.
[PSS22] G. Piliouras, R. Sim, and S. Skoulakis, “Beyond Time-Average Convergence: Near-Opti-
mal Uncoupled Online Learning via Clairvoyant Multiplicative Weights Update,” in Pro-
ceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS),
2022.
[RS13] S. Rakhlin and K. Sridharan, “Optimization, learning, and games with predictable se-
quences,” in Proceedings of the Annual Conference on Neural Information Processing
Systems (NIPS), 2013.
[Far+22] G. Farina, C.-W. Lee, H. Luo, and C. Kroer, “Kernelized Multiplicative Weights for
0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and Nor-
mal-Form Games,” in Proceedings of the International Conference on Machine Learn-
ing (ICML), 2022. [Online]. Available: https://proceedings.mlr.press/v162/farina22
a/farina22a.pdf

Content

Typo or question?

Metadata