Gabriele Farina - Learning in Games: Algorithms (Part I)

MIT 6.S890 — Topics in Multiagent Learning Thu, Sep 18th 2024
Lecture 5
Learning in games: Algorithms (Part I)
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In Lecture 4 we have mentioned how no-external-regret dynamics recover several solution concepts
of interest, including Nash equilibria in two-player zero-sum games, normal-form coarse-correlated
equilibria in multiplayer general-sum games, and more generally convex-concave saddle point
problems. In this lecture, we begin exploring how no-external-regret dynamics can be constructed,
starting from normal-form games.
1 No-regret algorithms for normal-form games
For a player in normal-form games the strategy space corresponds to the probability simplex Δ(𝐴)
of all distributions over the actions available to the player. As a reminder, constructing a regret
minimizer for Δ(𝐴) means that we need to build a mathematical object that supports two operations:
• NextStrategy() has the effect that the regret minimizer will output an element 𝑥(𝑡) ∈ Δ(𝐴);
• ObserveUtility(𝑢(𝑡)) provides the environment’s feedback to the regret minimizer, in the form
of a utility function 𝑢(𝑡). For today, we will focus on a case of particular relevance for normal-
form games: the case of linear utilities (as such are the utilities in a normal-form game). In
particular, to fix notation, we will let
𝑢(𝑡) : 𝑥 ↦ ⟨𝑔(𝑡), 𝑥⟩,
where 𝑔(𝑡) ∈ ℝ𝑛 is the vector that defines the linear function (called with the letter 𝑔 to suggest
the idea of it being the “gradient vector”) and ⟨⋅, ⋅⟩ denotes the standard dot product. Since
the utility in the game cannot be unbounded, we assume that the 𝑔(𝑡) can be arbitrary but
bounded in norm.
In this notation, the (external) regret is defined as the quantity
Reg(𝑇 ) ≔ max
̂
𝑥∈Δ(𝐴)
{∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ⟨𝑔(𝑡), 𝑥(𝑡)⟩}.
Our goal is to make sure that the regret grows sublinearly in 𝑇 no matter the utility vectors 𝑔(𝑡)
chosen by the environment.
General principle. Most algorithms known today for the task operate by prioritizing actions based
on how much regret they have incurred. In particular, a key quantity to define modern algorithms
is the vector of cumulated actions regrets
𝑟(𝑡) ≔ ∑
𝑡
𝜏=1
(𝑔(𝜏) − ⟨𝑔(𝜏), 𝑥(𝑡)⟩𝟏).
(Note that Reg(𝑡) = max𝑎∈𝐴 𝑟(𝑡)
𝑎 .) The natural question is: how to prioritize?
1.1 Follow-the-leader
Perhaps the most natural idea would be to always play the action with the highest cumulated regret
(breaking ties, say, lexicograpsically). After all, this is the action we wish the most we had played
in the past. This algorithm is called follow-the-leader (FTL). Unfortunately, this idea is known not
to work. To see that, consider the following sequence of gradient vectors:
𝑔(1) = ( 0
1/2), 𝑔(2) = (1
0), 𝑔(3) = (0
1), 𝑔(4) = (1
0), 𝑔(5) = (0
1), etc.
In this case, at time 1, we pick the first action and score a utility of 0. We then compute 𝑟(1) =
(0, 1/2) and at time 𝑡 = 2 pick the second action since it has the highest regret. This results in a
score of 0 and an updated regret vector 𝑟(2) = (1, 1/2) So, at time 𝑡 = 3, we will pick the first action,
resulting again in a score of 0 and an updated regret 𝑟(3) = (1, 3/2) and so on... Overall, it’s easy to
see that in all this jumping around, the regrets grow linearly.
Where to go from here? A few ideas seem natural:
• We can replace picking the action with the highest regret with picking actions proportionally
to their regret; this leads to the algorithm called regret matching, which we will discuss in
Section 1.2.
• We can smooth out the maximum operator by using the softmax function. This leads to the
multiplicative weights update algorithm, which we will discuss in Section 1.4.
• We can regularize the maximum operator by adding a term that penalizes large jumps in
the strategy space. This leads to a very flexible algorithm called follow-the-regularized-leader
algorithm, which we will discuss in Section 2 as well as in the next lecture.
All these ideas work. Before we move on, though, it is worth knowing that—while flawed—the follow-
the-leader algorithm is not completely hopeless.
Remark 1.1 (Fictitious play). In the canonical learning setup (see Lecture 4), the use of follow-
the-leader by all players goes under the name of fictitious play [Bro49; Bro51]. In certain classes of
games, including two-player zero-sum games, fictitious play is able to recover a Nash equilibrium,
albeit with a potentially exponentially slow convergence rate [Rob51].
1.2 The Regret Matching (RM) algorithm
The regret matching algorithm [HM00] picks probabilities proportional to the “ReLU” of the
cumulated regrets, that is,
𝑥(𝑡+1) ∝ [𝑟(𝑡)]+ (1)
whenever [𝑟(𝑡)]+ ≠ 0, and an arbitrary point otherwise. The algorithm is presented in pseudocode in
Algorithm 1.
Algorithm 1: Regret matching (RM)
1 𝑟(0) ← 0 ∈ ℝ𝐴, 𝑥(0) ← 𝟏/|𝐴| ∈ Δ(𝐴)
2 function NextStrategy()
3 if [𝑟(𝑡−1)]+ ≠ 0
4 return 𝑥(𝑡) ← [𝑟(𝑡−1)]+
‖[𝑟(𝑡−1)]+‖1
5 else
6 return 𝑥(𝑡) ← any strategy in Δ(𝐴)
7 function ObserveUtility(𝑔(𝑡))
8 𝑟(𝑡) ← 𝑟(𝑡−1) + 𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏
Algorithm 2: Regret matching+ (RM+)
1 𝑟(0) ← 0 ∈ ℝ𝐴, 𝑥(0) ← 𝟏/|𝐴| ∈ Δ(𝐴)
2 function NextStrategy()
3 if [𝑟(𝑡−1)]+ ≠ 0
4 return 𝑥(𝑡) ← [𝑟(𝑡−1)]+
‖[𝑟(𝑡−1)]+‖1
5 else
6 return 𝑥(𝑡) ← any strategy in Δ(𝐴)
7 function ObserveUtility(𝑔(𝑡))
8 𝑟(𝑡) ← [𝑟(𝑡−1) + 𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏]+
Theorem 1.1 (Regret bound for RM). The regret matching algorithm (Algorithm 1) is an
external regret minimizer, and satisfies the regret bound Reg(𝑇 ) ≤ Ω(√𝑇 ), where Ω is the
maximum norm of ‖𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏‖2 up to time 𝑇 .
In particular, if all the gradient vectors satisfy ‖𝑔(𝑡)‖∞ ≤ 1 at all times 𝑡 then the regret satisfies
Reg(𝑇 ) ≤ √𝑇 ⋅ |𝐴|.
Proof . We start by observing that, at all times 𝑡,
(𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏)⊤𝑥(𝑡+1) = 0
(this is always true, not just for regret matching). Plugging in the definition (1) of how 𝑥(𝑡+1) is
constructed, we therefore conclude that
(𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏)⊤[𝑟(𝑡)]+ = 0. (2)
(Note that the above equation holds trivially when [𝑟(𝑡)]+ = 0 and therefore 𝑥(𝑡+1) is picked
arbitrarily.) Now, we use the inequality
‖[𝑎 + 𝑏]+‖2
2 ≤ ‖[𝑎]+ + 𝑏‖2
2,
applied to 𝑎 = 𝑟(𝑡) and 𝑏 = 𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏. In particular, since by definition 𝑎 + 𝑏 =
𝑟(𝑡+1), we have
‖[𝑟(𝑡+1)]+‖2
2 ≤ ‖[𝑟(𝑡)]+ + (𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏)‖2
2
= ‖[𝑟(𝑡)]+‖2
2 + ‖𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏‖2
2 + 2(𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏)⊤[𝑟(𝑡)]+
= ‖[𝑟(𝑡)]+‖2
2 + ‖𝑔(𝑡+1) − ⟨𝑔(𝑡+1), 𝑥(𝑡+1)⟩𝟏‖2
2 (from (2))
≤ ‖[𝑟(𝑡)]+‖2
2 + Ω2.
Hence, by induction we have
‖[𝑟(𝑇 )]+‖2
2 ≤ 𝑇 Ω2 ⟹ Reg(𝑇 ) = max
𝑎∈𝐴 𝑟(𝑇 )
𝑎 ≤ max
𝑎∈𝐴 [𝑟(𝑇 )
𝑎 ]+ ≤ ‖[𝑟(𝑇 )]+‖2 ≤ Ω
√𝑇 .
The proof of the first part is then complete. The second part then just follows from using the
inequality
Ω ≤ ‖𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏‖2 ≤ ‖𝑔(𝑡)‖2 ≤ √|𝐴| ⋅ ‖𝑔(𝑡)‖∞. □
Our interest for the regret bound under the specific condition that ‖𝑔(𝑡)‖∞ ≤ 1 is as follows.
Remark 1.2. Within the canonical learning setup (see Lecture 4), the entries of 𝑔(𝑡) are the
expected payoffs of all actions of the players. Thus, the condition ‖𝑔(𝑡)‖∞ ≤ 1 corresponds to
the condition that the game’s payoffs lie in the interval [−1, 1].
As of today, regret matching and its variants are still often some of the most practical algorithms
for learning in games.
Remark 1.3. One very appealing property of the regret matching algorithm is its lack of
hyperparameters. It just works “out of the box”.
1.3 The regret matching+ (RM+) algorithm
The regret matching+ algorithm [Tam14; Tam+15] is given in Algorithm 2. It differs from RM only
on the last line, where a further thresholding is added. That small change has the effect that actions
with negative cumulated regret (that is, “bad” actions) are treated as actions with 0 regret. Hence,
intuitively, if a bad action were to become good over time, it would take less time for RM+ to notice
and act on that change. Because of that, regret matching+ has stronger practical performance and
is often preferred over regret matching in the game solving literature.
With a simple modification to the analysis of RM, the same bound as RM can be proven.
Theorem 1.2 (Regret bound for RM+). The RM+ algorithm (Algorithm 2) is an external
regret minimizer, and satisfies the regret bound Reg(𝑇 ) ≤ Ω√𝑇 , where Ω is the maximum norm
‖𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏‖2 up to time 𝑇 .
So again, if all the gradient vectors satisfy ‖𝑔(𝑡)‖∞ ≤ 1 at all times 𝑡 then the regret satisfies
Reg(𝑇 ) ≤ √𝑇 ⋅ |𝐴|.
1.4 Multiplicative weights update (MWU)
If we replace the “hard” maximum of follow-the-leader with the “soft” maximum given by
𝑥(𝑡)
𝑎 = softmax𝑎(𝜂𝑟(𝑡)) ≔ exp(𝜂𝑟(𝑡)
𝑎 )
∑𝑚
𝑗=1 exp(𝜂𝑟(𝑡)[𝑗]) ,
where 𝜂 > 0 is an inverse temperature parame-
ter, then we obtain the multiplicative weights
update algorithm [FS97]. This algorithm is
presented in Algorithm 3.
Compared to RM, MWU has a different flavor:
Algorithm 3: Multiplicative weights update (MWU)
1 𝑟(0) ← 0 ∈ ℝ𝐴, 𝑥(0) ← 𝟏/|𝐴| ∈ Δ(𝐴)
2 function NextStrategy()
3 return 𝑥(𝑡) ← softmax(𝜂𝑟(𝑡−1))
4 function ObserveUtility(𝑔(𝑡))
5 𝑟(𝑡) ← 𝑟(𝑡−1) + 𝑔(𝑡) − ⟨𝑔(𝑡), 𝑥(𝑡)⟩𝟏
it uses softmax instead of ReLU. This change in prioritization function has a pretty significant
impact on the regret bound that multiplicative weights guarantees. In particular, the following can
be shown:
Theorem 1.3 (Regret bound for MWU). The regret cumulated by the MWU algorithm can be
upper bounded as
Reg(𝑇 ) ≤ log|𝐴|
𝜂 + 𝜂 ∑
𝑇
𝑡=1
‖𝑔(𝑡)‖2
∞ − 1
8𝜂 ∑
𝑇
𝑡=2
‖𝑥(𝑡) − 𝑥(𝑡−1)‖2
1,
no matter the sequence of gradient vectors 𝑔(𝑡) chosen by the environment. In particular, if all
the gradient vectors satisfy ‖𝑔(𝑡)‖∞ ≤ 1 at all times 𝑡 and 𝜂 = √log|𝐴|/𝑇 , the regret satisfies
Reg(𝑇 ) ≤ √𝑇 log|𝐴|.
We will see the proof of Theorem 1.3 in the next lecture, as a reflection of a substantially more
general framework. For now, we remark a crucial aspect of MWU. Compared with the regret bound
of RM and RM+, the regret bound of MWU has only a logarithmic dependence on the number of
actions |𝐴|. Despite in practice RM/RM+ tend to outperform MWU (all while getting rid of any
hyperparameter tuning), this property has profound theoretical implications. We will discuss this
in more depth when discussing combinatorial games in Lecture 19. The gist of it is that several
important classes of games can be converted into exponentially large normal-form games (this is the
case of sequential games, for example). Since MWU only has logarithmic dependence on the number
of actions of the resulting normal-form games, this shows that—at least ignoring computation—
external regret minimization is possible with polynomial dependence on the game size even in these
classes of complex, structured games.
2 More general approaches: FTRL and OMD
Finally, we turn our attention to the third way of obtaining no-regret algorithms, that is, by
considering a regularized (i.e., smoothed) version of the follow-the-leader algorithm discussed above.
The idea is that, instead of playing by always putting 100% of the probability mass on the action
with highest cumulated regret, we look for the distribution that maximizes the expected cumulated
regret, minus some regularization term that prevents us from putting all the mass on a single action.
Definition 2.1 (Distance-generating function for a set). A differentiable function 𝜓 : Δ(𝐴) → ℝ
is a distance-generating function for the set 𝒳 if it is 1-strongly convex on 𝒳, that is,
(∇𝑓(𝑥) − ∇𝑓(𝑥′))⊤(𝑥 − 𝑥′) ≥ ‖𝑥 − 𝑥′‖(2) ∀𝑥, 𝑥′ ∈ 𝒳
with respect to some norm ‖·‖. For the purposes of this lecture, we will also assume that the
function attains minimum is in the relative interior of Δ(𝐴).
2.1 FTRL in the normal-form case
Definition 2.2 (FTRL, simplex case). Let 𝜓 be a distance-generating function for the strategy
set Δ(𝐴). The follow-the-regularized-leader algorithm (FTRL; sometimes also called “regularized
follow-the-leader”) defines the choice of strategy¹
𝑥(𝑡) ≔ arg max
̂
𝑥∈Δ(𝐴)
{⟨𝑟(𝑡−1), ̂ 𝑥⟩ −
1
𝜂 𝜓(̂𝑥)}.
The regularization term −𝜓(̂𝑥) limits the amount of variation between consecutive strategies. This
makes intuitive sense: if 𝜂 → 0, for example, 𝑥(𝑡) is constant (and equal to arg min̂𝑥∈Δ(𝐴) 𝜓( ̂𝑥)).
When 𝜂 = ∞, we recover the follow-the-leader algorithm, where strategies can jump arbitrarily. For
intermediate 𝜂, as you might expect, the amount of variation between strategies at consecutive times
is bounded above by a quantity proportional to 𝜂:
‖𝑥(𝑡) − 𝑥(𝑡)‖ ≤ 𝜂‖𝑔(𝑡)‖∗,
where ‖·‖∗ is the dual norm of ‖·‖ (if ‖·‖ is the Euclidean norm, its dual is also the Euclidean norm;
if ‖·‖ is the ℓ1 norm, then its dual is the ℓ∞ norm). We will see the regret bound for FTRL in the
general case in Section 2.3.
2.2 OMD in the normal-form case
The last general method that we mention today is the online mirror descent (OMD) algorithm.
This is the online generalization of the mirror descent optimization method, one of the workhorses
of modern optimization theory. A good way to think about OMD is as a generalization of gradient
ascent.
Definition 2.3 (OMD, simplex case). Let 𝜓 : Δ(𝐴) → ℝ be a distance-generating function. The
online mirror descent algorithm (OMD) defines the choice of strategy
𝑥(𝑡) ≔ arg max
̂
𝑥∈Δ(𝐴)
{⟨𝑔(𝑡−1), ̂ 𝑥⟩ −
1
𝜂 D𝜓(̂𝑥 ‖ 𝑥(𝑡−1))}, where D𝜓(̂𝑥 ‖ 𝑥) ≔ 𝜓(̂𝑥) − 𝜓(𝑥) − ⟨∇𝜓(𝑥), ̂ 𝑥 − 𝑥⟩.
Two choices of regularizer are standard for the probability simplex:
• The negative entropy function 𝐻(𝑥) ≔ ∑𝑎∈𝐴 𝑥𝑎 log 𝑥𝑎. This is 1-strongly convex with respect
to the ℓ1 norm ‖·‖1 (and therefore also with respect to ‖·‖2). OMD instantiated with this
regularizer leads to the multiplicative weights update algorithm seen in Section 1.4; see also
Section 2.4.
• The squared Euclidean norm 𝜓(𝑥) ≔ 1
2 ‖𝑥‖2
2, which is 1-strongly convex with respect to the
ℓ2 norm ‖·‖2. OMD instantiated with this regularizer leads to the online projected gradient
descent algorithm, which we will discuss in Section 2.5.
2.3 The general case
FTRL and OMD apply well beyond the case of probability simplices. In fact, they can be applied
to any convex and compact domain 𝒳. In this case, the vector of regrets 𝑟(𝑡) must be replaced with
the cumulative gradient vector ∑𝑡
𝜏=1 𝑔(𝜏), as follows:
Definition 2.4 (FTRL, general version). For a generic convex and compact domain 𝒳, the FTRL
algorithm produces strategies 𝑥(𝑡) by solving the optimization problem
𝑥(𝑡) ≔ arg max
̂
𝑥∈𝒳
{⟨∑
𝑡−1
𝜏=1
𝑔(𝜏), ̂ 𝑥⟩ − 1
𝜂 𝜓(̂𝑥)}.
Definition 2.5 (OMD, general version). The OMD algorithm produces strategies 𝑥(𝑡) by solving
the optimization problem
𝑥(𝑡) ≔ arg max
̂
𝑥∈𝒳
{⟨𝑔(𝑡−1), ̂ 𝑥⟩ −
1
𝜂 D𝜓(̂𝑥 ‖ 𝑥(𝑡))}.
We also remark the following connection between the two algorithms.
Remark 2.1. When 𝜓 is a Legendre regularizer (that is, the gradients of 𝜓 go to infinity at the
boundary of 𝒳), the OMD algorithm is equivalent to the FTRL algorithm.
We mention the following regret bound for the general case. We will mention an even more powerful
result next time.
Theorem 2.1 (Regret bound for FTRL and OMD). The regret cumulated by the FTRL and
OMD algorithms is upper bounded by
Reg(𝑇 ) ≤ max
𝑥,𝑥′∈𝒳
𝜓(𝑥′) − 𝜓(𝑥)
𝜂 + 𝜂 ∑
𝑇
𝑡=1
‖𝑔(𝑡)‖2
∗ − 1
8𝜂 ∑
𝑇
𝑡=2
‖𝑥(𝑡) − 𝑥(𝑡−1)‖2,
where ‖·‖∗ is the dual norm of ‖·‖. In particular, if the norm of the gradient vectors is bounded,
then by picking learning rate 𝜂 = 1√𝑇 , we obtain that Reg(𝑇 ) is bounded as roughly √𝑇 (a
sublinear function!) at all times 𝑇 .
2.4 Multiplicative weights update (MWU) as a special case
Multiplicative weights update is the special case of both FTRL and OMD in which the regularizer
𝜓
is set to the negative entropy function
𝐻(𝑥) ≔ ∑
𝑎∈𝐴
𝑥𝑎 log 𝑥𝑎.
Example 2.1. The plot on the right displays the negative
entropy function in the case of |𝐴| = 2 actions.
0.25 0.5 0.75 1 𝑥
−0.75
−0.5
−0.25
𝐻(𝑥, 1 − 𝑥)
The negative entropy function has the following properties:
• it is 1-strongly convex with respect to the ℓ1 norm ‖·‖1;
• the maximum of the function is 0, which is attained at any deterministic strategy;
• its minimum (that is, maximum entropy) is attained at the uniformly random strategy
𝑥 = (1/𝑚, ..., 1/𝑚),
at which 𝐻(𝑥) = 𝑚 ⋅ 1
𝑚 log 1
𝑚 = − log 𝑚.
Plugging the bound above into the general analysis of FTRL and OMD algorithms (Theorem 2.1)
yields Theorem 1.3.
2.5 Online Projected Gradient Ascent
In the special case in which 𝜓(𝑥) ≔ 1
2 ‖𝑥‖2
2, then the OMD algorithm reduces to online projected
gradient ascent.
Definition 2.6 (Online projected gradient ascent). The online projected gradient ascent algo-
rithm is the special case of OMD in which the regularizer is (half) the squared Euclidean norm,
that is, 𝜓(𝑥) = 1
2 ‖𝑥‖2
2. The choice of strategy is given by
𝑥(𝑡) = arg max
̂
𝑥∈𝒳
{⟨𝑔(𝑡−1), ̂ 𝑥⟩ −
1
𝜂 ‖̂𝑥 − 𝑥(𝑡−1)‖2
2} = Π𝒳(𝑥(𝑡−1) + 𝜂𝑔(𝑡−1)).
Since the squared Euclidean norm is 1-strongly convex with respect to the ℓ2 norm, the regret bound
for OGD can be derived from the general bound for OMD (Theorem 2.1) and is as follows.
Theorem 2.2 (Regret bound for OGD). The regret cumulated by the OGD algorithm can be
upper bounded as
Reg(𝑇 ) ≤ 1
𝜂 + 𝜂 ∑
𝑇
𝑡=1
‖𝑔(𝑡)‖2
2 − 1
8𝜂 ∑
𝑇
𝑡=2
‖𝑥(𝑡) − 𝑥(𝑡−1)‖2
2,
no matter the sequence of gradient vectors 𝑔(𝑡) chosen by the environment. In particular, if all
the gradient vectors satisfy ‖𝑔(𝑡)‖∞ ≤ 1 at all times 𝑡 and 𝜂 = √|𝐴|/𝑇 , the regret satisfies
Reg(𝑇 ) ≤ √𝑇 |𝐴|.
The following plots illustrate the behavior of OGD and MWU in a simple 2 × 2 game.
Example 2.2. The plots on the right show the behavior of the online projected gradient ascent
(OGD) and multiplicative weights update (MWU) algorithms in the 2 × 2 game given by
U1 = (2
0
1
2).
The unique Nash equilib-
rium of the game is in 𝑥∗ =
( 2
3 , 1
3 ), 𝑦∗ = (1
3 , 2
3 ). The pur-
ple dot indicates the start-
ing strategy. The gray dot-
ted line tracks the profile
of average strategies, which
converges to an approximate
Nash equilibrium as proved
in Lecture 4.
0
0 1
1 OGD (𝜂 = 0.1)
𝑥(𝑡)
2
𝑦(𝑡)
2
0
0 1
1 MWU (𝜂 = 0.25)
𝑥(𝑡)
2
𝑦(𝑡)
2
Bibliography
[Bro49] G. W. Brown, Some notes on computation of games solutions. Rand Corporation, 1949.
[Bro51] G. W. Brown, “Iterative solution of games by fictitious play,” Act. Anal. Prod Allocation,
vol. 13, no. 1, p. 374, 1951.
[Rob51] J. Robinson, “An iterative method of solving a game,” Annals of Mathematics, vol. 54,
no. 2, pp. 296–301, 1951, [Online]. Available: http://www.jstor.org/stable/1969530
[HM00] S. Hart and A. Mas-Colell, “A Simple Adaptive Procedure Leading to Correlated Equi-
librium,” Econometrica, vol. 68, no. 5, pp. 1127–1150, 2000, [Online]. Available: http://
www.jstor.org/stable/2999445
[Tam14] O. Tammelin, “Solving large imperfect information games using CFR+,” arXiv, 2014,
[Online]. Available: https://arxiv.org/abs/1407.5042
[Tam+15] O. Tammelin, N. Burch, M. Johanson, and M. Bowling, “Solving Heads-up Limit Texas
Hold'em,” in International Joint Conference on Artificial Intelligence (IJCAI), 2015.
[FS97] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning
and an application to boosting,” Journal of computer and system sciences, vol. 55, no.
1, pp. 119–139, 1997.
Changelog
• Sep 24: expanded discussion on online projected gradient ascent and added Remark 2.1.
• Sep 25: fixed a typo in Theorem 2.2.
• Sep 27: fixed a typo in the proof of Theorem 1.1.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
¹In particular, 𝑥(1) ≔ arg min̂ 𝑥∈Δ(𝐴) 𝜓( ̂𝑥) since the regrets of all actions are 0 at the beginning.

Content

Typo or question?

Metadata