MIT 6.S890 — Topics in Multiagent Learning Tue, Sep 17th 2024
Lecture 4
Learning in games: Foundations
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
With this class we begin to explore what it means to “learn” in a game, and how that “learning”,
which is intrinsically a dynamic and local (per-player) concept, relates to the much more static and
global concept of game-theoretic equilibrium.
1 Hindsight rationality and Φ-regret
What does it mean to “learn” in games? Multiple answers are correct. However, today we focus on
a powerful answer through the concept of hindsight rationality.
Take the point of view of one player in a game, and denote with 𝒳 be their set of available strategies.
In normal-form games, we have seen that a strategy is just a distribution over the set of available
actions 𝐴 for the player, so 𝒳 = Δ(𝐴). At each time 𝑡 = 1, 2, ..., the player will play some strategy
𝑥(𝑡) ∈ 𝒳, receive some form of feedback, and will incorporate that feedback to formulate a “better”
strategy 𝑥(𝑡+1) ∈ 𝒳 for the next repetition of the game. A typical (and natural) choice of “feedback”
is just the utility of the player, given what all the other agents played. However, for the purposes of
the abstract model we are building today, let’s not make any assumptions about how the feedback
is assigned; we will strive to build algorithms that perform competitively under any feedback—even
adversarial one.
Now suppose that the game is played infinite times, and looking back at what was played by the
player we realize that every single time the player played a certain strategy 𝑥, they would have been
strictly better by consistently playing different strategy 𝑥′ instead. Can we really say that the player
has “learnt” how to play? Perhaps not. This concept goes under the name of hindsight rationality:
Definition 1.1 (Hindsight rationality, informal). The player has “learnt” to play the game if
looking back at the history of play, they cannot think of any transformation 𝜙 : 𝒳 → 𝒳 of their
strategies that, when applied at the whole history of play, would have given strictly better utility
to the player.
We have thus arrived to the following formalization.
Definition 1.2 (Φ-regret minimizer). Given the convex and compact strategy set 𝒳 and a set
Φ of linear transformations 𝜙 : 𝒳 → 𝒳, a Φ-regret minimizer for the set 𝒳 is a model for a
decision maker that repeatedly interacts with a black-box environment. At each time 𝑡, the
regret minimizer interacts with the environment through two operations:
• NextStrategy() takes no input, and has the effect that the regret minimizer will output
an element 𝑥(𝑡) ∈ 𝒳.
• ObserveUtility(𝑢(𝑡)) provides the environment’s feedback to the regret minimizer, in the
form of a linear utility function 𝑢(𝑡) : 𝒳 → ℝ that evaluates how good the last-output point
𝑥(𝑡) was. The utility function can depend adversarially on the outputs 𝑥(1), ..., 𝑥(𝑡).
Its quality metric is its cumulative Φ-regret, defined as the quantity
Φ-Reg(𝑇 ) ≔ max
̂
𝜙∈Φ
{∑
𝑇
𝑡=1
𝑢(𝑡)( ̂𝜙(𝑥(𝑡))) − 𝑢(𝑡)(𝑥(𝑡))},
The goal for a Φ-regret minimizer is to guarantee that its Φ-regret grows asymptotically sublin-
early as time 𝑇 increases, no matter the sequence of utility functions 𝑢(𝑡).
Calls to NextStrategy and ObserveUtility keep alternating to each other: first, the regret minimizer
will output a point 𝑥(1), then it will received feedback 𝑢(1) from the environment, then it will output
a new point 𝑥(2), and so on. The decision making encoded by the regret minimizer is online, in
the sense that at each time 𝑡, the output of the regret minimizer can depend on the prior outputs
𝑥(1), ..., 𝑥(𝑡−1) and corresponding observed utility functions 𝑢(1), ..., 𝑢(𝑡−1), but no information about
future utilities is available.
1.1 Some notable choices for the set of transformations Φ considered
The size of the set of transformations Φ considered by the player defines a natural notion of how
“rational” the agent is. There are several choices of interest for Φ for a normal-form strategy space
𝒳 = Δ(𝐴).
• Φ = set of all stochastic matrices, mapping Δ(𝐴) → Δ(𝐴). This notion of Φ-regret is known
under the name swap regret. This notion is related to convergence to the set of correlated
equilibria.
• Φ = set of all “probability mass transport” on 𝒳, defined as
Φ = {𝜙𝑎→𝑏}𝑎,𝑏∈𝐴, where (𝜙𝑎→𝑏(𝑥))𝑠 ≔
{
{
{
{
{0 if 𝑠 = 𝑎 (remove mass from 𝑎...)
𝑥𝑏 + 𝑥𝑎 if 𝑠 = 𝑏 (... and give it to 𝑏)
𝑥𝑠 otherwise.
This is known as internal regret.
Theorem 1.1 (Informal; formal version in Theorem 2.3). When all agents in a multiplayer
general-sum normal-form game play so that their internal or swap regret grows sublinearly,
their average correlated distribution of play converges to the set of correlated equilibria of
the game.
In sequential games, the above concept extends to Φ = a particular set of linear transformations
called trigger deviation functiona. It is known that in this case the Φ-regret can be efficiently
bounded with a polynomial dependence on the size of the game tree. The reason why this
choice of deviation functions is important is given by the following fact.
Theorem 1.2 (Informal). When all agents in a multiplayer general-sum extensive-form
game play so that their Φ-regret relative to trigger deviation functions grows sublinearly,
their average correlated distribution of play converges to the set of extensive-form correlated
equilibria of the game.
• Φ = constant transformations. In this case, we are only requiring that the player not regret
substituting all of the strategies they played with the same strategy ̂ 𝑥 ∈ Δ(𝐴). Φ-regret
according to this set of transformations Φ is usually called external regret, or more simply just
regret. While this seems like an extremely restricted notion of rationality, it actually turns out
to be already extremely powerful. We will spend the rest of this class to see why.
Theorem 1.3 (Informal; formal version in Theorem 2.3). When all agents in a multiplayer
general-sum normal-form game play so that their external regret grows sublinearly, their
average correlated distribution of play converges to the set of coarse correlated equilibrium
of the game.
Corollary 1.1 (Informal). When all agents in a two-player zero-sum normal-form game
play so that their external regret grows sublinearly, their average strategies converge to
the set of Nash equilibria of the game.
1.2 An important special case: regret minimization
The special case where Φ is chosen to be the set of constant transformations is so important that it
warrants its own special definition and notation.
Definition 1.3 (Regret minimizer). Let 𝒳 be a set. An external regret minimizer for 𝒳—or
simply “regret minimizer for 𝒳”—is a Φconst-regret minimizer for the special set of constant
transformations
Φconst ≔ {𝜙̂𝑥 : 𝑥 ↦ ̂ 𝑥}̂𝑥∈𝒳.
Its corresponding Φconst-regret is called “external regret” or simply “regret”, and it is indicated
with
Reg(𝑇 ) ≔ max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)(̂𝑥) − 𝑢(𝑡)(𝑥(𝑡))}.
Again, the goal for a regret minimizer is to ensure its cumulative regret Reg(𝑇 ) grows sublinearly
in 𝑇 . An important result asserts the existence of algorithms that guarantee sublinear regret for
any convex and compact domain 𝒳, typically of the order Reg(𝑇 ) = 𝑂(√𝑇 ) asymptotically. As we
will show below, external regret minimization alone is enough to guarantee convergence to Nash
equilibrium in two-player zero-sum games, to coarse correlated equilibrium in multiplayer general-
sum games, to best responses to static stochastic opponents in multiplayer general-sum games, and
much more.
■ Teaser: From regret minimization to Φ-regret minimization. As discussed, regret minimization is
one instantiation of Φ-regret minimization—and perhaps the smallest sensible instantiation. Then,
clearly, coming up with a regret minimizer for a set 𝒳 cannot be harder than the problem of coming
up with a Φ-regret minimizer for 𝒳 for richer sets of transformation functions Φ. It might then seem
surprising that there exists a construction that reduces Φ-regret minimization to regret minimization.
We will discuss more about this in Lecture 8.
2 Applications of regret minimization
To establish regret minimization as a meaningful abstraction for learning in games, we check that
regret minimizing and Φ-regret minimizing dynamics indeed lead to the expected behavior in
common scenarios.
Definition 2.1 (Canonical learning setup). In the cases that we will mention, a recurring idea
will be to consider the setup in which all players 𝑖 ∈ [𝑛] play according to the outputs 𝑥(𝑡)
𝑖 of a
Φ-regret minimizer. At each iteration, the utility function that each player 𝑖 observes from the
environment is the utility function 𝑢𝑖 of that player, evaluated in the strategies played by all
players, that is,
𝑢(𝑡) : Δ(𝐴𝑖) → ℝ 𝑢(𝑡)(𝑥𝑖) ≔ 𝑢𝑖(𝑥𝑖, 𝑥(𝑡)
−𝑖 ).
Given its importance, we give to this natural setup the name of canonical learning setup.
2.1 Learning a best response against stochastic opponents
As a first smoke test, let’s verify that over time a regret minimizer would learn how to best respond to
static, stochastic opponents. Specifically, consider this scenario. We are playing a repeated 𝑛-player
general-sum game with multilinear utilities (this captures normal-form game and extensive-form
games alike), where Players 𝑖 = 1, ..., 𝑛 − 1 play stochastically, that is, at each 𝑡 they independently
sample a strategy 𝑥(𝑡)
𝑖 ∈ 𝒳𝑖 from the same fixed distribution (which is unknown to any other player).
Formally, this means that
𝔼[𝑥(𝑡)
𝑖 ] = 𝑥𝑖 ∀𝑖 = 1, ..., 𝑛 − 1, 𝑡 = 1, 2, ....
Player 𝑛, on the other hand, is learning in the game, picking strategies according to some algorithm
that guarantees sublinear external regret, where the feedback observed by Player 𝑛 at each time 𝑡
is their own linear utility function:
𝒖(𝑡) ≔ 𝒳𝑛 ∋ 𝑥𝑛 ↦ 𝑢𝑛(𝑥(𝑡)
1 , ..., 𝑥(𝑡)
𝑛−1, 𝑥𝑛).
Then, the average of the strategies played by Player 𝑛 converges almost surely to a best response to
𝑥1, ..., 𝑥𝑛−1, that is,
1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡)
𝑛 ⟶⟶⟶⟶⟶⟶⟶⟶
a.s.
arg max
̂
𝑥𝑛∈𝒳𝑛
{𝑢𝑛(𝑥1, ..., 𝑥𝑛−1, ̂ 𝑥𝑛)}.
(You should try to prove this!)
2.2 Convergence to the set of Nash equilibria in two-player zero-sum games
It turns out that regret minimization can be used to converge to bilinear saddle points, that is
solutions to problems of the form
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦, (1)
where 𝒳 and 𝒴 are convex compact sets and 𝑈 is a matrix. These types of optimization problems
are pervasive in game-theory. The canonical prototype of bilinear saddle point problem is the
computation of Nash equilibria in two-player zero-sum games (either normal-form or extensive-form).
There, a Nash equilibrium is the solution to (1) where 𝒳 and 𝒴 are the strategy spaces of Player 1
and Player 2 respectively (probability simplexes for normal-form games or sequence-form polytopes
for extensive-form games), and 𝑈 is the payoff matrix for Player 1. Other examples include social-
welfare-maximizing correlated equilibria and optimal strategies in two-team zero-sum adversarial
team games.
The idea behind using regret minimization to converge to bilinear saddle-point problems is to use
self play. We instantiate two regret minimization algorithms, ℛ𝒳 and ℛ𝒴, for the domains of the
maximization and minimization problem, respectively. At each time 𝑡 the two regret minimizers
output strategies 𝑥(𝑡) and 𝑦(𝑡), respectively. Then, they receive feedback 𝑢(𝑡)
𝒳 , 𝑢(𝑡)
𝒴 defined as
𝑢(𝑡)
𝒳 : 𝑥 ↦ (𝑈 𝑦(𝑡))⊤𝑥, 𝑢(𝑡)
𝒴 : 𝑦 ↦ −(𝑈 ⊤𝑥(𝑡))⊤𝑦.
We can summarize the process pictorially as follows.
ℛ𝒳
ℛ𝒴
𝑢(𝑡−1)
𝒳
𝑢(𝑡−1)
𝒴
𝑥(𝑡)
𝑦(𝑡) 𝑢(𝑡)
𝒴
𝑢(𝑡)
𝒳
ℛ𝒳
ℛ𝒴
𝑥(𝑡+1)
𝑦(𝑡+1)
A well known folk theorem establish that the pair of average strategies produced by the regret
minimizers up to any time 𝑇 converges to a saddle point of (1), where convergence is measured via
the saddle point gap
0 ≤ 𝛾(𝑥, 𝑦) ≔ (max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦} − 𝑥⊤𝑈 𝑦) + (𝑥⊤𝑈 𝑦 − min
̂
𝑦∈𝒴 {𝑥⊤𝑈 ̂ 𝑦}) = max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦} − min
̂
𝑦∈𝒴 {𝑥⊤𝑈 ̂ 𝑦}.
A point (𝑥, 𝑦) ∈ 𝒳 × 𝒴 has zero saddle point gap if and only if it is a solution to (1).
Theorem 2.1. Consider the self-play setup summarized in the figure above, where ℛ𝒳 and ℛ𝒴
are regret minimizers for the sets 𝒳 and 𝒴, respectively. Let Reg(𝑇 )
𝒳 and Reg(𝑇 )
𝒴 be the (sublinear)
regret cumulated by ℛ𝒳 and ℛ𝒴, respectively, up to time 𝑇 , and let 𝑥(𝑇 ) and 𝑦(𝑇 ) denote the
average of the strategies produced up to time 𝑇 . Then, the saddle point gap 𝛾(𝑥(𝑇 ), 𝑦(𝑇 )) of
(𝑥(𝑇 ), 𝑦(𝑇 )) satisfies
𝛾(𝑥(𝑇 ), 𝑦(𝑇 )) = Reg(𝑇 )
𝒳 + Reg(𝑇 )
𝒴
𝑇 → 0 as 𝑇 → ∞.
Proof . By definition of regret,
Reg(𝑇 )
𝒳 + Reg(𝑇 )
𝒴
𝑇 = 1
𝑇 max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (̂𝑥)} − 1
𝑇 ∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (𝑥𝑡) + 1
𝑇 max
̂
𝑦∈𝒴 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (̂𝑦)} − 1
𝑇 ∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (𝑦𝑡)
= 1
𝑇 max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (̂𝑥)} + 1
𝑇 max
̂
𝑦∈𝒴 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (̂𝑦)} (since 𝑢(𝑡)
𝒳 (𝑥(𝑡)) + 𝑢(𝑡)
𝒴 (𝑦(𝑡)) = 0
= max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦(𝑇 )} − min
̂
𝑦∈𝒴 {(𝑥(𝑇 ))⊤𝑈 ̂ 𝑦}
= 𝛾(𝑥(𝑇 ), 𝑦(𝑇 )).
Letting 𝑇 → ∞ and using the sublinearity of regret, we obtain the statement. □
2.3 Proof of the minimax theorem
The very existence of regret minimizers is a powerful enough fact to imply the minimax theorem!
Theorem 2.2 (Minimax theorem). Let 𝒳 and 𝒴 be convex compact sets, and let 𝑈 be a matrix.
Suppose that a regret minimizer ℛ𝒳 for set 𝒳 guaranteeing sublinear regret no matter the
sequence of utilities can be constructed. Then,
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 = min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦.
Proof . One direction of the equality, specifically
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≤ min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦,
follows from definition (this is often called weak duality).
To show the reverse inequality, we will interpret the bilinear saddle point min𝑦∈𝒴 max𝑥∈𝒳 𝑥⊤𝑈 𝑦 as
a repeated game. At each time 𝑡, we will let a regret minimizer ℛ𝒳 pick actions 𝑥(𝑡) ∈ 𝒳, whereas
we will always assume that 𝑦(𝑡) ∈ 𝒴 is chosen by the environment to best respond to 𝑥(𝑡), that is,
𝑦(𝑡) ∈ arg min
𝑦∈𝒴
(𝑥(𝑡))⊤𝑈 𝑦.
The utility function observed by ℛ𝒳 at each time 𝑡 is set to the linear function
𝑢(𝑡)
𝒳 (𝑥) = 𝑥⊤𝑈 𝑦(𝑡).
Letting 𝑥(𝑇 ) ∈ 𝒳 and 𝑦(𝑇 ) ∈ 𝒴 be the average strategies output up to time 𝑇 , that is,
𝑥(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡) 𝑦(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑦(𝑡),
then we have
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≥ min
𝑦∈𝒴 {(𝑥(𝑇 ))⊤𝑈 𝑦} =
1
𝑇 min
𝑦∈𝒴 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦 ≥ 1
𝑇 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦(𝑡).
The important insight is that the right-hand side can be related to the regret incurred on 𝒳: by
definition,
1
𝑇 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦(𝑡) = − Reg(𝑇 )
𝒳
𝑇 + 1
𝑇 max
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑥⊤𝑈 𝑦(𝑡)}
= − Reg(𝑇 )
𝒳
𝑇 + max
𝑥∈𝒳 𝑥⊤𝑈 𝑦(𝑇 )
≥ − Reg(𝑇 )
𝒳
𝑇 + min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦
Combining the expressions, we obtain
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≥ min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦 − Reg(𝑇 )
𝒳
𝑇 .
Letting 𝑇 → ∞ proves the result. □
2.4 Convergence to the set of correlated and coarse-correlated equilibria
The previous result is in fact a direct corollary of the more general connection between Φ-regret
minimization and the set of coarse-correlated equilibria in multiplayer general-sum games. We present
a general form of this connection in the next theorem.
Theorem 2.3 (Formal version of Theorems 1.1 and 1.3). Let 𝑥(𝑡)
1 , ..., 𝑥(𝑡)
𝑛 the strategies played
by the players at any time 𝑡, and let Φ-Reg(𝑡)
𝑖 denote the internal regret incurred by Player 𝑖
up to time 𝑡. Consider now the average correlated distribution of play up to any time 𝑇 , that
is, the distribution 𝜇(𝑇 ) that selects a time 𝑡 uniformly at random from the set {1, ..., 𝑇 }, and
selects actions (𝑎1, ..., 𝑎𝑛) independendently according to the 𝑥(𝑡)
𝑖 , that is,
𝜇(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡)
1 ⊗ ... ⊗ 𝑥(𝑡)
𝑛 .
This distribution satisfies the inequality
max
𝜙∈Φ 𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)] ≤ Φ-Reg(𝑇 )
𝑖
𝑇 .
Proof . Pick an arbitrary 𝜙 ∈ Φ. With the usual slight abuse of notation, we will denote with
𝜙(𝑎), where 𝑎 is an action, as the strategy returned by 𝜙 when evaluated in the deterministic
strategy that places all the mass on 𝑎. Expanding the specific structure of 𝜇(𝑇 ), we can decompose
the expectation
𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)]
as
𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)]
= 1
𝑇 ∑
𝑇
𝑡=1
𝔼𝑎∼𝑥(𝑡)
1 ⊗...⊗𝑥(𝑡)
𝑛
[(𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖))]
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝜙(𝑎𝑖)], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]) − 𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]))
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝜙(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖]), 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]) − 𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]))
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝜙(𝑥(𝑡)
𝑖 ), 𝑥(𝑡)
−𝑖 ) − 𝑢𝑖(𝑥(𝑡)
𝑖 , 𝑥(𝑡)
−𝑖 ))
where the second equality follows by linearity of 𝜙 and 𝑢𝑖. Taking now a maximum over 𝜙 ∈ Φ,
and recognizing the definition of Φ-regret on the right-hand side, we obtain the desired inequality.
□
Note that Theorem 2.3 holds for any set Φ. The approximate equilibria found this way are sometimes
called approximate Φ-equilibria. In the special cases of Φ = all constant transformations, it is clear
that the previous result implies convergence to the set of coarse correlated equilibria. For correlated
equilibria, we need to convince ourselves that any arbitrary mapping 𝐴 → 𝐴 can be represented via
a stochastic matrix. This is indeed the case, by constructing the matrix whose columns indicate
what action is assigned to each action in 𝐴 by the mapping. (You should convince yourself!) Finally,
for the case of Φ = all probability mass transportations, it is enough to note that the Φ-regret of
any stochastic matrix transformations is at most |𝐴| times larger than the worst possible regret of
a probability mass transportation between two actions.
Changelog
• Sep 18: various typos fixed.
• Sep 18: tweaked description of regret minimizers.
• Sep 19: added Definition 2.1 (Canonical learning setup).
• Oct 04: fixed typo (thanks Kat).
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
Lecture 4
Learning in games: Foundations
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
With this class we begin to explore what it means to “learn” in a game, and how that “learning”,
which is intrinsically a dynamic and local (per-player) concept, relates to the much more static and
global concept of game-theoretic equilibrium.
1 Hindsight rationality and Φ-regret
What does it mean to “learn” in games? Multiple answers are correct. However, today we focus on
a powerful answer through the concept of hindsight rationality.
Take the point of view of one player in a game, and denote with 𝒳 be their set of available strategies.
In normal-form games, we have seen that a strategy is just a distribution over the set of available
actions 𝐴 for the player, so 𝒳 = Δ(𝐴). At each time 𝑡 = 1, 2, ..., the player will play some strategy
𝑥(𝑡) ∈ 𝒳, receive some form of feedback, and will incorporate that feedback to formulate a “better”
strategy 𝑥(𝑡+1) ∈ 𝒳 for the next repetition of the game. A typical (and natural) choice of “feedback”
is just the utility of the player, given what all the other agents played. However, for the purposes of
the abstract model we are building today, let’s not make any assumptions about how the feedback
is assigned; we will strive to build algorithms that perform competitively under any feedback—even
adversarial one.
Now suppose that the game is played infinite times, and looking back at what was played by the
player we realize that every single time the player played a certain strategy 𝑥, they would have been
strictly better by consistently playing different strategy 𝑥′ instead. Can we really say that the player
has “learnt” how to play? Perhaps not. This concept goes under the name of hindsight rationality:
Definition 1.1 (Hindsight rationality, informal). The player has “learnt” to play the game if
looking back at the history of play, they cannot think of any transformation 𝜙 : 𝒳 → 𝒳 of their
strategies that, when applied at the whole history of play, would have given strictly better utility
to the player.
We have thus arrived to the following formalization.
Definition 1.2 (Φ-regret minimizer). Given the convex and compact strategy set 𝒳 and a set
Φ of linear transformations 𝜙 : 𝒳 → 𝒳, a Φ-regret minimizer for the set 𝒳 is a model for a
decision maker that repeatedly interacts with a black-box environment. At each time 𝑡, the
regret minimizer interacts with the environment through two operations:
• NextStrategy() takes no input, and has the effect that the regret minimizer will output
an element 𝑥(𝑡) ∈ 𝒳.
• ObserveUtility(𝑢(𝑡)) provides the environment’s feedback to the regret minimizer, in the
form of a linear utility function 𝑢(𝑡) : 𝒳 → ℝ that evaluates how good the last-output point
𝑥(𝑡) was. The utility function can depend adversarially on the outputs 𝑥(1), ..., 𝑥(𝑡).
Its quality metric is its cumulative Φ-regret, defined as the quantity
Φ-Reg(𝑇 ) ≔ max
̂
𝜙∈Φ
{∑
𝑇
𝑡=1
𝑢(𝑡)( ̂𝜙(𝑥(𝑡))) − 𝑢(𝑡)(𝑥(𝑡))},
The goal for a Φ-regret minimizer is to guarantee that its Φ-regret grows asymptotically sublin-
early as time 𝑇 increases, no matter the sequence of utility functions 𝑢(𝑡).
Calls to NextStrategy and ObserveUtility keep alternating to each other: first, the regret minimizer
will output a point 𝑥(1), then it will received feedback 𝑢(1) from the environment, then it will output
a new point 𝑥(2), and so on. The decision making encoded by the regret minimizer is online, in
the sense that at each time 𝑡, the output of the regret minimizer can depend on the prior outputs
𝑥(1), ..., 𝑥(𝑡−1) and corresponding observed utility functions 𝑢(1), ..., 𝑢(𝑡−1), but no information about
future utilities is available.
1.1 Some notable choices for the set of transformations Φ considered
The size of the set of transformations Φ considered by the player defines a natural notion of how
“rational” the agent is. There are several choices of interest for Φ for a normal-form strategy space
𝒳 = Δ(𝐴).
• Φ = set of all stochastic matrices, mapping Δ(𝐴) → Δ(𝐴). This notion of Φ-regret is known
under the name swap regret. This notion is related to convergence to the set of correlated
equilibria.
• Φ = set of all “probability mass transport” on 𝒳, defined as
Φ = {𝜙𝑎→𝑏}𝑎,𝑏∈𝐴, where (𝜙𝑎→𝑏(𝑥))𝑠 ≔
{
{
{
{
{0 if 𝑠 = 𝑎 (remove mass from 𝑎...)
𝑥𝑏 + 𝑥𝑎 if 𝑠 = 𝑏 (... and give it to 𝑏)
𝑥𝑠 otherwise.
This is known as internal regret.
Theorem 1.1 (Informal; formal version in Theorem 2.3). When all agents in a multiplayer
general-sum normal-form game play so that their internal or swap regret grows sublinearly,
their average correlated distribution of play converges to the set of correlated equilibria of
the game.
In sequential games, the above concept extends to Φ = a particular set of linear transformations
called trigger deviation functiona. It is known that in this case the Φ-regret can be efficiently
bounded with a polynomial dependence on the size of the game tree. The reason why this
choice of deviation functions is important is given by the following fact.
Theorem 1.2 (Informal). When all agents in a multiplayer general-sum extensive-form
game play so that their Φ-regret relative to trigger deviation functions grows sublinearly,
their average correlated distribution of play converges to the set of extensive-form correlated
equilibria of the game.
• Φ = constant transformations. In this case, we are only requiring that the player not regret
substituting all of the strategies they played with the same strategy ̂ 𝑥 ∈ Δ(𝐴). Φ-regret
according to this set of transformations Φ is usually called external regret, or more simply just
regret. While this seems like an extremely restricted notion of rationality, it actually turns out
to be already extremely powerful. We will spend the rest of this class to see why.
Theorem 1.3 (Informal; formal version in Theorem 2.3). When all agents in a multiplayer
general-sum normal-form game play so that their external regret grows sublinearly, their
average correlated distribution of play converges to the set of coarse correlated equilibrium
of the game.
Corollary 1.1 (Informal). When all agents in a two-player zero-sum normal-form game
play so that their external regret grows sublinearly, their average strategies converge to
the set of Nash equilibria of the game.
1.2 An important special case: regret minimization
The special case where Φ is chosen to be the set of constant transformations is so important that it
warrants its own special definition and notation.
Definition 1.3 (Regret minimizer). Let 𝒳 be a set. An external regret minimizer for 𝒳—or
simply “regret minimizer for 𝒳”—is a Φconst-regret minimizer for the special set of constant
transformations
Φconst ≔ {𝜙̂𝑥 : 𝑥 ↦ ̂ 𝑥}̂𝑥∈𝒳.
Its corresponding Φconst-regret is called “external regret” or simply “regret”, and it is indicated
with
Reg(𝑇 ) ≔ max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)(̂𝑥) − 𝑢(𝑡)(𝑥(𝑡))}.
Again, the goal for a regret minimizer is to ensure its cumulative regret Reg(𝑇 ) grows sublinearly
in 𝑇 . An important result asserts the existence of algorithms that guarantee sublinear regret for
any convex and compact domain 𝒳, typically of the order Reg(𝑇 ) = 𝑂(√𝑇 ) asymptotically. As we
will show below, external regret minimization alone is enough to guarantee convergence to Nash
equilibrium in two-player zero-sum games, to coarse correlated equilibrium in multiplayer general-
sum games, to best responses to static stochastic opponents in multiplayer general-sum games, and
much more.
■ Teaser: From regret minimization to Φ-regret minimization. As discussed, regret minimization is
one instantiation of Φ-regret minimization—and perhaps the smallest sensible instantiation. Then,
clearly, coming up with a regret minimizer for a set 𝒳 cannot be harder than the problem of coming
up with a Φ-regret minimizer for 𝒳 for richer sets of transformation functions Φ. It might then seem
surprising that there exists a construction that reduces Φ-regret minimization to regret minimization.
We will discuss more about this in Lecture 8.
2 Applications of regret minimization
To establish regret minimization as a meaningful abstraction for learning in games, we check that
regret minimizing and Φ-regret minimizing dynamics indeed lead to the expected behavior in
common scenarios.
Definition 2.1 (Canonical learning setup). In the cases that we will mention, a recurring idea
will be to consider the setup in which all players 𝑖 ∈ [𝑛] play according to the outputs 𝑥(𝑡)
𝑖 of a
Φ-regret minimizer. At each iteration, the utility function that each player 𝑖 observes from the
environment is the utility function 𝑢𝑖 of that player, evaluated in the strategies played by all
players, that is,
𝑢(𝑡) : Δ(𝐴𝑖) → ℝ 𝑢(𝑡)(𝑥𝑖) ≔ 𝑢𝑖(𝑥𝑖, 𝑥(𝑡)
−𝑖 ).
Given its importance, we give to this natural setup the name of canonical learning setup.
2.1 Learning a best response against stochastic opponents
As a first smoke test, let’s verify that over time a regret minimizer would learn how to best respond to
static, stochastic opponents. Specifically, consider this scenario. We are playing a repeated 𝑛-player
general-sum game with multilinear utilities (this captures normal-form game and extensive-form
games alike), where Players 𝑖 = 1, ..., 𝑛 − 1 play stochastically, that is, at each 𝑡 they independently
sample a strategy 𝑥(𝑡)
𝑖 ∈ 𝒳𝑖 from the same fixed distribution (which is unknown to any other player).
Formally, this means that
𝔼[𝑥(𝑡)
𝑖 ] = 𝑥𝑖 ∀𝑖 = 1, ..., 𝑛 − 1, 𝑡 = 1, 2, ....
Player 𝑛, on the other hand, is learning in the game, picking strategies according to some algorithm
that guarantees sublinear external regret, where the feedback observed by Player 𝑛 at each time 𝑡
is their own linear utility function:
𝒖(𝑡) ≔ 𝒳𝑛 ∋ 𝑥𝑛 ↦ 𝑢𝑛(𝑥(𝑡)
1 , ..., 𝑥(𝑡)
𝑛−1, 𝑥𝑛).
Then, the average of the strategies played by Player 𝑛 converges almost surely to a best response to
𝑥1, ..., 𝑥𝑛−1, that is,
1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡)
𝑛 ⟶⟶⟶⟶⟶⟶⟶⟶
a.s.
arg max
̂
𝑥𝑛∈𝒳𝑛
{𝑢𝑛(𝑥1, ..., 𝑥𝑛−1, ̂ 𝑥𝑛)}.
(You should try to prove this!)
2.2 Convergence to the set of Nash equilibria in two-player zero-sum games
It turns out that regret minimization can be used to converge to bilinear saddle points, that is
solutions to problems of the form
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦, (1)
where 𝒳 and 𝒴 are convex compact sets and 𝑈 is a matrix. These types of optimization problems
are pervasive in game-theory. The canonical prototype of bilinear saddle point problem is the
computation of Nash equilibria in two-player zero-sum games (either normal-form or extensive-form).
There, a Nash equilibrium is the solution to (1) where 𝒳 and 𝒴 are the strategy spaces of Player 1
and Player 2 respectively (probability simplexes for normal-form games or sequence-form polytopes
for extensive-form games), and 𝑈 is the payoff matrix for Player 1. Other examples include social-
welfare-maximizing correlated equilibria and optimal strategies in two-team zero-sum adversarial
team games.
The idea behind using regret minimization to converge to bilinear saddle-point problems is to use
self play. We instantiate two regret minimization algorithms, ℛ𝒳 and ℛ𝒴, for the domains of the
maximization and minimization problem, respectively. At each time 𝑡 the two regret minimizers
output strategies 𝑥(𝑡) and 𝑦(𝑡), respectively. Then, they receive feedback 𝑢(𝑡)
𝒳 , 𝑢(𝑡)
𝒴 defined as
𝑢(𝑡)
𝒳 : 𝑥 ↦ (𝑈 𝑦(𝑡))⊤𝑥, 𝑢(𝑡)
𝒴 : 𝑦 ↦ −(𝑈 ⊤𝑥(𝑡))⊤𝑦.
We can summarize the process pictorially as follows.
ℛ𝒳
ℛ𝒴
𝑢(𝑡−1)
𝒳
𝑢(𝑡−1)
𝒴
𝑥(𝑡)
𝑦(𝑡) 𝑢(𝑡)
𝒴
𝑢(𝑡)
𝒳
ℛ𝒳
ℛ𝒴
𝑥(𝑡+1)
𝑦(𝑡+1)
A well known folk theorem establish that the pair of average strategies produced by the regret
minimizers up to any time 𝑇 converges to a saddle point of (1), where convergence is measured via
the saddle point gap
0 ≤ 𝛾(𝑥, 𝑦) ≔ (max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦} − 𝑥⊤𝑈 𝑦) + (𝑥⊤𝑈 𝑦 − min
̂
𝑦∈𝒴 {𝑥⊤𝑈 ̂ 𝑦}) = max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦} − min
̂
𝑦∈𝒴 {𝑥⊤𝑈 ̂ 𝑦}.
A point (𝑥, 𝑦) ∈ 𝒳 × 𝒴 has zero saddle point gap if and only if it is a solution to (1).
Theorem 2.1. Consider the self-play setup summarized in the figure above, where ℛ𝒳 and ℛ𝒴
are regret minimizers for the sets 𝒳 and 𝒴, respectively. Let Reg(𝑇 )
𝒳 and Reg(𝑇 )
𝒴 be the (sublinear)
regret cumulated by ℛ𝒳 and ℛ𝒴, respectively, up to time 𝑇 , and let 𝑥(𝑇 ) and 𝑦(𝑇 ) denote the
average of the strategies produced up to time 𝑇 . Then, the saddle point gap 𝛾(𝑥(𝑇 ), 𝑦(𝑇 )) of
(𝑥(𝑇 ), 𝑦(𝑇 )) satisfies
𝛾(𝑥(𝑇 ), 𝑦(𝑇 )) = Reg(𝑇 )
𝒳 + Reg(𝑇 )
𝒴
𝑇 → 0 as 𝑇 → ∞.
Proof . By definition of regret,
Reg(𝑇 )
𝒳 + Reg(𝑇 )
𝒴
𝑇 = 1
𝑇 max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (̂𝑥)} − 1
𝑇 ∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (𝑥𝑡) + 1
𝑇 max
̂
𝑦∈𝒴 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (̂𝑦)} − 1
𝑇 ∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (𝑦𝑡)
= 1
𝑇 max
̂
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒳 (̂𝑥)} + 1
𝑇 max
̂
𝑦∈𝒴 {∑
𝑇
𝑡=1
𝑢(𝑡)
𝒴 (̂𝑦)} (since 𝑢(𝑡)
𝒳 (𝑥(𝑡)) + 𝑢(𝑡)
𝒴 (𝑦(𝑡)) = 0
= max
̂
𝑥∈𝒳 {̂𝑥⊤𝑈 𝑦(𝑇 )} − min
̂
𝑦∈𝒴 {(𝑥(𝑇 ))⊤𝑈 ̂ 𝑦}
= 𝛾(𝑥(𝑇 ), 𝑦(𝑇 )).
Letting 𝑇 → ∞ and using the sublinearity of regret, we obtain the statement. □
2.3 Proof of the minimax theorem
The very existence of regret minimizers is a powerful enough fact to imply the minimax theorem!
Theorem 2.2 (Minimax theorem). Let 𝒳 and 𝒴 be convex compact sets, and let 𝑈 be a matrix.
Suppose that a regret minimizer ℛ𝒳 for set 𝒳 guaranteeing sublinear regret no matter the
sequence of utilities can be constructed. Then,
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 = min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦.
Proof . One direction of the equality, specifically
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≤ min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦,
follows from definition (this is often called weak duality).
To show the reverse inequality, we will interpret the bilinear saddle point min𝑦∈𝒴 max𝑥∈𝒳 𝑥⊤𝑈 𝑦 as
a repeated game. At each time 𝑡, we will let a regret minimizer ℛ𝒳 pick actions 𝑥(𝑡) ∈ 𝒳, whereas
we will always assume that 𝑦(𝑡) ∈ 𝒴 is chosen by the environment to best respond to 𝑥(𝑡), that is,
𝑦(𝑡) ∈ arg min
𝑦∈𝒴
(𝑥(𝑡))⊤𝑈 𝑦.
The utility function observed by ℛ𝒳 at each time 𝑡 is set to the linear function
𝑢(𝑡)
𝒳 (𝑥) = 𝑥⊤𝑈 𝑦(𝑡).
Letting 𝑥(𝑇 ) ∈ 𝒳 and 𝑦(𝑇 ) ∈ 𝒴 be the average strategies output up to time 𝑇 , that is,
𝑥(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡) 𝑦(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑦(𝑡),
then we have
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≥ min
𝑦∈𝒴 {(𝑥(𝑇 ))⊤𝑈 𝑦} =
1
𝑇 min
𝑦∈𝒴 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦 ≥ 1
𝑇 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦(𝑡).
The important insight is that the right-hand side can be related to the regret incurred on 𝒳: by
definition,
1
𝑇 ∑
𝑇
𝑡=1
(𝑥(𝑡))⊤𝑈 𝑦(𝑡) = − Reg(𝑇 )
𝒳
𝑇 + 1
𝑇 max
𝑥∈𝒳 {∑
𝑇
𝑡=1
𝑥⊤𝑈 𝑦(𝑡)}
= − Reg(𝑇 )
𝒳
𝑇 + max
𝑥∈𝒳 𝑥⊤𝑈 𝑦(𝑇 )
≥ − Reg(𝑇 )
𝒳
𝑇 + min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦
Combining the expressions, we obtain
max
𝑥∈𝒳 min
𝑦∈𝒴 𝑥⊤𝑈 𝑦 ≥ min
𝑦∈𝒴 max
𝑥∈𝒳 𝑥⊤𝑈 𝑦 − Reg(𝑇 )
𝒳
𝑇 .
Letting 𝑇 → ∞ proves the result. □
2.4 Convergence to the set of correlated and coarse-correlated equilibria
The previous result is in fact a direct corollary of the more general connection between Φ-regret
minimization and the set of coarse-correlated equilibria in multiplayer general-sum games. We present
a general form of this connection in the next theorem.
Theorem 2.3 (Formal version of Theorems 1.1 and 1.3). Let 𝑥(𝑡)
1 , ..., 𝑥(𝑡)
𝑛 the strategies played
by the players at any time 𝑡, and let Φ-Reg(𝑡)
𝑖 denote the internal regret incurred by Player 𝑖
up to time 𝑡. Consider now the average correlated distribution of play up to any time 𝑇 , that
is, the distribution 𝜇(𝑇 ) that selects a time 𝑡 uniformly at random from the set {1, ..., 𝑇 }, and
selects actions (𝑎1, ..., 𝑎𝑛) independendently according to the 𝑥(𝑡)
𝑖 , that is,
𝜇(𝑇 ) ≔ 1
𝑇 ∑
𝑇
𝑡=1
𝑥(𝑡)
1 ⊗ ... ⊗ 𝑥(𝑡)
𝑛 .
This distribution satisfies the inequality
max
𝜙∈Φ 𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)] ≤ Φ-Reg(𝑇 )
𝑖
𝑇 .
Proof . Pick an arbitrary 𝜙 ∈ Φ. With the usual slight abuse of notation, we will denote with
𝜙(𝑎), where 𝑎 is an action, as the strategy returned by 𝜙 when evaluated in the deterministic
strategy that places all the mass on 𝑎. Expanding the specific structure of 𝜇(𝑇 ), we can decompose
the expectation
𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)]
as
𝔼𝑎∼𝜇(𝑇 ) [𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖)]
= 1
𝑇 ∑
𝑇
𝑡=1
𝔼𝑎∼𝑥(𝑡)
1 ⊗...⊗𝑥(𝑡)
𝑛
[(𝑢𝑖(𝜙(𝑎𝑖), 𝑎−𝑖) − 𝑢𝑖(𝑎𝑖, 𝑎−𝑖))]
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝜙(𝑎𝑖)], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]) − 𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]))
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝜙(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖]), 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]) − 𝑢𝑖(𝔼𝑎𝑖∼𝑥(𝑡)
𝑖
[𝑎𝑖], 𝔼𝑎−𝑖∼⊗𝑥(𝑡)
−𝑖
[𝑎−𝑖]))
= 1
𝑇 ∑
𝑇
𝑡=1
(𝑢𝑖(𝜙(𝑥(𝑡)
𝑖 ), 𝑥(𝑡)
−𝑖 ) − 𝑢𝑖(𝑥(𝑡)
𝑖 , 𝑥(𝑡)
−𝑖 ))
where the second equality follows by linearity of 𝜙 and 𝑢𝑖. Taking now a maximum over 𝜙 ∈ Φ,
and recognizing the definition of Φ-regret on the right-hand side, we obtain the desired inequality.
□
Note that Theorem 2.3 holds for any set Φ. The approximate equilibria found this way are sometimes
called approximate Φ-equilibria. In the special cases of Φ = all constant transformations, it is clear
that the previous result implies convergence to the set of coarse correlated equilibria. For correlated
equilibria, we need to convince ourselves that any arbitrary mapping 𝐴 → 𝐴 can be represented via
a stochastic matrix. This is indeed the case, by constructing the matrix whose columns indicate
what action is assigned to each action in 𝐴 by the mapping. (You should convince yourself!) Finally,
for the case of Φ = all probability mass transportations, it is enough to note that the Φ-regret of
any stochastic matrix transformations is at most |𝐴| times larger than the worst possible regret of
a probability mass transportation between two actions.
Changelog
• Sep 18: various typos fixed.
• Sep 18: tweaked description of regret minimizers.
• Sep 19: added Definition 2.1 (Canonical learning setup).
• Oct 04: fixed typo (thanks Kat).
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.