Gabriele Farina - Learning in games: Algorithms

MIT 6.S890 — Topics in Multiagent Learning (F23) Thu, September 21st 2023
Lecture 5
Learning in Games: Algorithms
Instructor: Gabriele Farina
In Lecture 4 we have mentioned how no-external-regret dynamics recover several solution concepts of
interest, including Nash equilibria in two-player zero-sum games, normal-form coarse-correlated equilibria in
multiplayer general-sum games, and more generally convex-concave saddle point problems. Furthermore, we
have seen that no-external-regret dynamics are a fundamental building block for constructing no-Φ-regret
dynamics for general sets Φ of deviation functions.
In this lecture, we start exploring how no-external-regret dynamics for players in normal-form games can
be constructed. As a reminder, a player with n action has a strategy space that corresponds to the probability
simplex of all distributions over those actions,
∆m = {(x1, . . . , xm) ∈ ℝm
≥0 : x1 + · · · + xm = 1}.
Furthermore, remember that constructing a regret minimizer for ∆m means that we need to build a
mathematical object that supports two operations:
• NextStrategy() has the effect that the regret minimizer will output an element x(t) ∈ ∆m;
• ObserveUtility(u(t)) provides the environment’s feedback to the regret minimizer, in the form of a utility
function u(t). For today, we will focus on a case of particular relevance for normal-form games: the case
of linear utilities (as such are the utilities in a normal-form game). In particular, to fix notation, we will
let
u(t) : x 7 → (g(t))⊤x,
where g(t) ∈ ℝn is the vector that defines the linear function (called with the letter g to suggest the idea
of it being the “gradient vector”). Since the utility in the game cannot be unbounded, we assume that
the g(t) can be arbitrary but bounded in norm.
In this notation, the (external) regret is defined as the quantity
Reg(T ) := max
ˆx∈∆m
{ T∑
t=1
(
⟨g(t), ˆx⟩ − ⟨g(t), x(t)⟩
){
.
Our goal is to make sure that the regret grows sublinearly in T no matter the utility vectors g(t) chosen by
the environment.
Most algorithms known today for the task operate by prioritizing actions based on how much regret they
have incurred. In particular, a key quantity to define modern algorithms is the vector of cumulated actions
regrets
r(t) :=
t∑
τ =1
g(τ ) − ⟨g(τ ), x(t)⟩1.
(Note that Reg(t) = maxi∈{1,...,m} r(t)[i].) The natural question is: how to prioritize?
1 An algorithm that does NOT work: Follow-the-Leader
Perhaps the most natural idea would be to always play the action with the highest cumulated regret (breaking
ties, say, lexicographically). After all, this is the action we wish the most we had played in the past.
Unfortunately, this idea is known not to work. To see that, consider the following sequence of gradient vectors:
g(1) =
( 0
1/2
)
, g(2) =
(1
0
)
, g(3) =
(0
1
)
, g(4) =
(1
0
)
, g(5) =
(0
1
)
, etc.
In this case, at time 1, we pick the first action and score a utility of 0. We then compute
r(1) =
( 0
1/2
)
.
and at time t = 2 pick the second action since it has the highest regret. This results in a score of 0 and an
updated regret vector
r(2) =
( 1
1/2
)
So, at time t = 3, we will pick the first action, resulting again in a score of 0 and an updated regret
r(3) =
( 1
3/2
)
,
and so on... Overall, it’s easy to see that in all this jumping around, the regrets grow linearly.
2 The Regret Matching (RM) algorithm
The regret matching algorithm picks probabilities proportional to the “ReLU” of the cumulated regrets, that
is,
x(t+1) ∝ [r(t)]+ (1)
whenever [r(t)] ̸ = 0, and an arbitrary point otherwise.
The algorithm is presented in pseudocode in Algorithm 1.
Theorem 2.1. The regret matching algorithm (Algorithm 1) is an external regret minimizer, and satisfies
the regret bound RT ≤ Ω√T , where Ω is the maximum norm ∥g(t) − ⟨g(t), x(t)⟩1∥2 up to time T .
Proof. We start by observing that, at all times t,
(
g(t+1) − ⟨g(t+1), x(t+1)⟩1
)⊤
x(t+1) = 0
(this is always true, not just for regret matching). Plugging in the definition (1) of how x(t+1) is constructed,
we therefore conclude that
(
g(t+1) − ⟨g(t+1), x(t+1)⟩1
)⊤
[r(t)]+ = 0. (2)
(Note that the above equation holds trivially when [r(t)]+ and therefore x(t+1) is picked arbitrarily.)
Now, we use the following inequality:
∥[a + b]+∥2
2 ≤ ∥[a]+ + b∥2
2,
applied to a = r(t) and b = g(t+1) − ⟨g(t+1), x(t+1)⟩1. In particular, since by definition a + b = r(t+1), we
have
∥[r(t+1)]+∥2
2 ≤ ∥[r(t)]+ + (g(t+1) − ⟨g(t+1), x(t+1)⟩1)∥2
2
= ∥[r(t)]+∥2
2 + ∥g(t+1) − ⟨g(t+1), x(t+1)⟩1∥2
2 + 2
(
g(t+1) − ⟨g(t+1), x(t+1)⟩1
)⊤
[r(t)]+
= ∥[r(t)]+∥2
2 + ∥g(t+1) − ⟨g(t+1), x(t+1)⟩1∥2
2 (from (2))
≤ ∥[r(t)]+∥2
2 + Ω2.
Hence, by induction we have
∥[r(T )]+∥2
2 ≤ T Ω2 =⇒ Reg(T ) = max
i∈{1,...,m} r(t)[i] ≤ max
i∈{1,...,m}
[r(t)[i]]+ ≤ ∥[r(T )]+∥2 ≤ Ω√T .
The proof is then complete.
Algorithm 1: Regret matching
1 r0 ← 0 ∈ ℝn, x0 ← 1/m ∈ ∆m
2 function NextStrategy()
3 if [r(t−1)]+ ̸ = 0 return x(t) ← [r(t−1)]+
∥[r(t−1)]+∥1
4 else return x(t) ← any point in ∆m
5 function ObserveUtility(gt)
6 r(t) ← r(t−1) + g(t) − ⟨g(t), x(t)⟩1
Algorithm 2: Regret matching+
1 z0 ← 0 ∈ ℝn, x0 ← 1/m ∈ ∆m
2 function NextStrategy()
3 if [r(t−1)]+ ̸ = 0 return x(t) ← [r(t−1)]+
∥[r(t−1)]+∥1
4 else return x(t) ← any point in ∆m
5 function ObserveUtility(gt)
6 z(t) ← [z(t−1) + g(t) − ⟨g(t), x(t)⟩1]+
As of today, regret matching and its variants are still often some of the most practical algorithms for
learning in games.
Remark 2.1. One very appealing property of the regret matching algorithm is its lack of hyperparameters.
It just works “out of the box”.
2.1 The regret matching+ (RM+) algorithm
The regret matching+ algorithm was introduced by Tammelin [2014], Tammelin et al. [2015], and is given
in Algorithm 2. It differs from RM only on the last line, where a further thresholding is added. That small
change has the effect that actions with negative cumulated regret (that is, “bad” actions) are treated as actions
with 0 regret. Hence, intuitively, if a bad action were to become good over time, it would take less time for
RM+ to notice and act on that change. Because of that, regret matching+ has stronger practical performance
and is often preferred over regret matching in the game solving literature.
With a simple modification to the analysis of RM, the same bound as RM can be proven.
Theorem 2.2. The RM+ algorithm (Algorithm 2) is an external regret minimizer, and satisfies the regret
bound RT ≤ Ω√T , where Ω is the maximum norm ∥g(t) − ⟨g(t), x(t)⟩1∥2 up to time T .
3 Follow-the-Regularized-Leader (FTRL)
Another way of obtaining no-regret algorithms is by considering a regularized (i.e., smoothed) version of the
follow-the-leader algorithm discussed above. The idea is that, instead of playing by always putting 100% of
the probability mass on the action with highest cumulated regret, we look for the distribution that maximizes
the expected cumulated regret, minus some regularization term that prevents us from putting all the mass on
a single action.
More specifically, let φ : ∆m → ℝ be a 1-strongly convex1 function (with respect to some norm ∥ · ∥, for
example, the Euclidean norm ∥ · ∥2) with bounded range, and consider the choice of strategy
x(t) := arg max
ˆx∈∆m
{
(r(t−1))⊤ ˆx − 1
η φ( ˆx)
}
.
(so, in particular x(1) := arg minˆx∈∆m φ( ˆx) since the regrets of all actions are 0 at the beginning). This
algorithm is called the follow-the-regularized-leader (FTRL) algorithm. The following two properties are
known about FTRL:
• Regularization limits the amount of variation between consecutive strategies. This makes intuitive sense:
if η → 0, for example, x(t) is constant (and equal to arg minˆx∈∆m φ( ˆx)). When η = ∞, we recover the
follow-the-leader algorithm, where strategies can jump arbitrarily. For intermediate η, as you might
expect, the amount of variation between strategies at consecutive times is bounded above by a quantity
proportional to η:
∥x(t) − x(t)∥ ≤ η∥g(t)∥∗, (3)
where ∥ · ∥∗ is the dual norm of ∥ · ∥ (if ∥ · ∥ is the Euclidean norm, its dual is also the Euclidean norm;
if ∥ · ∥ is the ℓ1 norm, then its dual is the ℓ∞ norm).
• A useful regret bound of FTRL is given by
Reg(T ) ≤ max
ˆx∈∆m
φ( ˆx) − φ(x(1))
η +
T∑
t=1
(g(t))⊤(x(t) − x(t)). (4)
Using the Cauchy-Schwarz inequality in (4) and plugging in (3), we obtain the following
Theorem 3.1. The regret cumulated by the FTRL algorithm is upper bounded by
Reg(T ) ≤ max
ˆx∈∆m
φ( ˆx) − φ(x(1))
η + η
T∑
t=1
∥g(t)∥2
∗.
Since the norm of the gradient vectors is bounded, it follows immediately that by picking learning
rate η = 1/√T , we obtain that Reg(T ) is bounded as roughly √T (a sublinear function!) at all times T .
One of the appeals of FTRL is that it is a very general method, and in fact it generalized well beyond
probability simplexes.
We will now see a particular instantiation of FTRL that is so important and well-studied that it deserves
its own name: the Multiplicative Weights Update method (MWU).
3.1 The Multiplicative Weights Update Algorithm
A natural choice of convex regularizer for probability simplexes is the negative entropy function
h(x) := ∑
i∈{1,...,m}
x[i] log x[i].
Example 3.1. For example, this is what negative entropy looks like in the case of m = 2 actions:
0 0.2 0.4 0.6 0.8 1
−0.6
−0.4
−0.2
0
Probability of first action, x[1]
Negative entropy h(x)
The negative entropy function has the following properties:
• It is 1-strongly convex with respect to the ℓ1 norm ∥ · ∥1.
• The maximum of the function is 0, which is attained at any deterministic strategy.
• Its minimum (that is, maximum entropy) is attained at the uniformly random strategy ̄x = (1/m, . . . , 1/m),
at which the function has value h( ̄x) = m · 1/m log 1/m = − log m.
In this case, then, the MWU algorithm reduces to the following:
x(1) := (1/m, . . . , 1/m),
x(t) := arg max
ˆx∈∆m
{
(r(t−1))⊤ ˆx − 1
η h( ˆx)
}
= arg max
ˆx∈∆m
{
(r(t−1))⊤ ˆx − 1
η
m∑
i=1
x[i] log x[i]
{
. (5)
As it turns out via a simple application of the Lagrange multiplier theorem, the concave optimization problem
in (5) has a closed-form solution:
Proposition 3.1. The concave optimization problem (5) has the closed-form solution
x(t) = softmax{ηr(t)},
that is, for every action i ∈ {1, . . . , m},
x(t)[i] = eηr(t)[i]
∑m
j=1 eηr(t)[j] .
This gives rise to Algorithm 3, which has a similar flavor as RM, but uses a different prioritization of
actions (in particular, it uses softmax instead of ReLU ). From the general analysis of FTRL (Theorem 3.1),
Algorithm 3: Multiplicative Weights Update
1 r0 ← 0 ∈ ℝn, x0 ← 1/m ∈ ∆m
2 function NextStrategy()
3 return x(t) = softmax(ηr(t−1))
4 function ObserveUtility(gt)
5 r(t) ← r(t−1) + g(t) − ⟨g(t), x(t)⟩1
remembering that:
• the maximum of h is 0,
• the minimum (attained in x(1)) is − log m,
• negative entropy is 1-strongly convex with respect to ∥ · ∥1 and that the dual norm to ∥ · ∥1 is ∥ · ∥∞,
we obtain the following.
Corollary 3.1. The regret cumulated by the MWU algorithm is upper bounded by
Reg(T ) ≤ log m
η + η
T∑
t=1
∥g(t)∥2
∞.
References
Oskari Tammelin. Solving large imperfect information games using CFR+, 2014.
Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em.
In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2015.
Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econo-
metrica, 68:1127–1150, 2000.
David Blackwell. An analog of the minmax theorem for vector payoffs. Pacific Journal of Mathematics, 6:1–8,
1956.
Appendix
The regret matching (RM) algorithm has a connection to an ancient (and very elegant!) construction, called
Blackwell approachability. Blackwell approachability is a precursor of the theory of regret minimization, and
played a fundamental role in the historical development of several efficient online optimization methods. In
particular, as we will show in a minute, the problem of minimizing regret on a simplex can be rewritten as a
Blackwell approachability game. The solution of the Blackwell approachability game will then recover exactly
RM.
A Blackwell approachability game
Blackwell approachability generalizes the problem of playing a repeated two-player game to games whose
utilites are vectors instead of scalars.
Definition A.1. A Blackwell approachability game is a tuple (X , Y, f , S), where X , Y are closed convex
sets, f : X × Y → ℝd is a biaffine function, and S ⊆ ℝd is a closed and convex target set. A Blackwell
approachability game represents a vector-valued repeated game between two players. At each time t, the
two payers interact in this order:
• first, Player 1 selects an action x(t) ∈ X ;
• then, Player 2 selects an action y(t) ∈ Y, which can depend adversarially on all the xt output so far;
• finally, Player 1 incurs the vector-valued payoff f (xt, yt) ∈ ℝd, where f is a biaffine function.
Player 1’s objective is to guarantee that the average payoff converges to the target set S. Formally, given
target set S ⊆ ℝd, Player 1’s goal is to pick actions x(1), x(2), . . . ∈ X such that no matter the actions
y(1), y(2), . . . ∈ Y played by Player 2,
min
ˆs∈S
∥
∥
∥
∥
∥ˆs − 1
T
T∑
t=1
f (x(t), y(t))
∥
∥
∥
∥
∥2
→ 0 as T → ∞. (6)
B Regret minimization on the simplex via Blackwell approachability
Hart and Mas-Colell [2000] noted that the construction of a regret minimizer for a simplex domain ∆n can be
reduced to constructing an algorithm for a particular Blackwell approachability game Γ := (∆n, ℝn, f , ℝn
≤0)
which we now describe. For all i ∈ {1, . . . , n}, the i-th component of the vector-valued payoff function f
measures the change in regret incurred at time t, compared to always playing the i-th vertex ei of the simplex.
Formally, f : ∆n × ℝn → ℝn is defined as
f (x(t), g(t)) := g(t) − ⟨g(t), x(t)⟩1, (7)
where 1 is the n-dimensional vector whose components are all 1. (Note the connection with r(t) seen in
Lecture 5).
The following lemma establishes an important link between Blackwell approachability on Γ and external
regret minimization on the simplex ∆n.
Lemma B.1. The regret Reg(T ) = maxˆx∈∆n 1
T
∑T
t=1⟨g(t), ˆx − x(t)⟩ cumulated up to any time T by any
sequence of decisions x(1), . . . , x(T ) ∈ ∆n is related to the distance of the average Blackwell payoff from
the target cone ℝn
≤0 as
Reg(T )
T ≤ min
ˆs∈ℝn
≤0
∥
∥
∥
∥
∥ˆs − 1
T
T∑
t=1
f (x(t), g(t))
∥
∥
∥
∥
∥2
. (8)
So, a strategy for the Blackwell approachability game Γ is a regret-minimizing strategy for the simplex
domain ∆n.
Proof. For any ˆx ∈ ∆n, the regret cumulated compared to always playing ˆx satisfies
1
T Reg(T )( ˆx) := 1
T
T∑
t=1
(
⟨g(t), ˆx⟩ − ⟨g(t), x(t)⟩
)
= 1
T
T∑
t=1
(
⟨g(t), ˆx⟩ − ⟨g(t), x(t)⟩⟨1, ˆx⟩
)
=
}
1
T
T∑
t=1
g(t) − ⟨g(t), x(t)⟩1, ˆx
〉
=
}
1
T
T∑
t=1
f (x(t), g(t)), ˆx
〉
= min
ˆs∈ℝn
≤0
}
−ˆs + 1
T
T∑
t=1
f (x(t), g(t)), ˆx
〉
, (9)
where we used the fact that ˆx ∈ ∆n in the second equality, and that minˆs∈ℝn
≤0 ⟨−ˆs, ˆx⟩ = 0 since ˆx ≥ 0.
Applying the Cauchy-Schwarz inequality to the right-hand side of (9), we obtain
1
T Reg(T )( ˆx) ≤ min
ˆs∈ℝn
≤0
∥
∥
∥
∥
∥−ˆs + 1
T
T∑
t=1
f (x(t), g(t))
∥
∥
∥
∥
∥2
∥ ˆx∥2.
So, using the fact that ∥ ˆx∥2 ≤ 1 for any ˆx ∈ ∆n,
1
T Reg(T )( ˆx) ≤ min
ˆs∈ℝn
≤0
∥
∥
∥
∥
∥−ˆs + 1
T
T∑
t=1
f (x(t), g(t))
∥
∥
∥
∥
∥2
.
Taking a max over ˆx ∈ ∆n yields the statement.
C Solving Blackwell games: Blackwell’s algorithm
A central concept in the theory of Blackwell approachability is the following.
Definition C.1 (Forceable halfspace). Let (X , Y, f , S) be a Blackwell approachability game and let H ⊆ ℝd
be a halfspace, that is, a set of the form H = {x ∈ ℝd : a⊤x ≤ b} for some a ∈ ℝd, b ∈ ℝ. The halfspace
H is said to be forceable if there exists a strategy of Player 1 that guarantees that the payoff is in H no
matter the actions played by Player 2, that is, if there exists x∗ ∈ X such that
f (x∗, y) ∈ H ∀ y ∈ Y.
When that is the case, we call action x∗ a forcing action for H.
Blackwell’s approachability theorem [Blackwell, 1956] states the following.
Theorem C.1 (Blackwell’s theorem). Goal (6) can be attained if and only if every halfspace H ⊇ S is
forceable.
We constructively prove the direction that shows how forceability translates into a sequence of strategies
that guarantees that goal (6) is attained. Let (X , Y, f , S) be the Blackwell game. The method is pretty simple:
at each time step t = 1, 2, . . . operate the following:
1. Compute the average payoff received so far, that is, φ(t) = 1
t
∑t−1
τ =1 f (x(τ ), y(τ )).
2. Compute the Euclidean projection ψ(t) of φ(t) onto the target set S.
3. If φ(t) ∈ S (that is, goal (6) has already been met), pick and play any x(t) ∈ X , observe the opponent’s
action yt, and return.
4. Else, consider the halfspace H(t) tangent to S at the projection point ψ(t), that contains S. In symbols,
H(t) := {z ∈ ℝd : (φ(t) − ψ(t))⊤z ≤ (φ(t) − ψ(t))⊤ψ(t)}.
5. By hypothesis, H(t) is forceable. Pick x(t) to be a forcing action for H(t), observe the opponent’s action
y(t), and return.
The above method is summarized in Figure 1.
H(t)
S
×
φ(t)
ψ(t)
f (x(t), yt)
φ(t+1)
Figure 1: Construction of the approachability strategy described in Appendix C.
Let’s see how the average payoff φ(t) changes when we play as described above. Clearly,
φ(t+1) = 1
t
(t)
∑
τ =1
f (x(τ ), y(τ )) = t − 1
t φ(t) + 1
t f (x(t), y(t)).
Hence, denoting with ρ(t+1) the squared Euclidean distance between φ(t+1) and the target set, that is,
ρ(t) := min
ˆs∈S
∥
∥
∥ˆs − φ(t)
∥
∥
∥2
2
,
we have
ρ(t+1) ≤
∥
∥
∥ψ(t) − φ(t+1)
∥
∥
∥2
2
=
∥
∥
∥
∥ψ(t) − t − 1
t φ(t) − 1
t f (x(t), y(t))
∥
∥
∥
∥
2
2
=
∥
∥
∥
∥
t − 1
t
(
ψ(t) − φ(t))
+ 1
t
(
ψ(t) − f (x(t), y(t))
)∥
∥
∥
∥
2
2
= (t − 1)2
t2 ρ(t) + 1
t2
∥
∥
∥ψ(t) − f (x(t), y(t))
∥
∥
∥2
2
+ 2(t − 1)
t2
〈
ψ(t) − φ(t), ψ(t) − f (x(t), y(t))
〉
. (10)
The proof so far does not use any particular assumption about how x(t) is picked. Here is where that enters
the picture. If φ(t) ∈ S, then ψ(t) = φ(t) and therefore the last inner product is equal to 0. Otherwise, we
have that ψ(t) − φ(t) ̸ = 0. In that case, x(t) is constructed by forcing the halfspace H(t), and therefore, no
matter how y(t) is picked by the opponent we have
f (x(t), y(t)) ∈ H(t) ⇐⇒ (φ(t) − ψ(t))⊤f (x(t), y(t)) ≥ (φ(t) − ψ(t))⊤ψ(t)
⇐⇒
〈
ψ(t) − φ(t), ψ(t) − f (x(t), y(t))
〉
≤ 0.
Plugging in the last inequality into (10) and bounding ∥ψ(t) − f (x(t), y(t))∥2
2 ≤ Ω2 where Ω2 is a diameter
parameter of the game (which only depends on f and S), we obtain
ρ(t+1) ≤ (t − 1)2
t2 ρ(t) + Ω2
t2 =⇒ t2ρ(t+1) − (t − 1)2ρ(t) ≤ Ω2 ∀ t = 1, 2, . . . .
Summing the inequality above for t = 0, . . . , T − 1 and removing the telescoping terms, we obtain
T 2ρ(T +1) ≤ T Ω2 =⇒ ρ(T +1) ≤ Ω2
T =⇒ min
ˆs∈S
∥
∥
∥
∥
∥ˆs − 1
T
T∑
t=1
f (x(t), y(t))
∥
∥
∥
∥
∥2
≤ Ω
√T , (11)
which implies that the average payoff in the Blackwell game converges to S at a rate of O(1/√T ).
D The regret matching (RM) algorithm as an instance of Blackwell’s
algorithm
First, recall from Appendix B that the external regret minimization on the simplex can be solved via the
Blackwell game Γ := (∆n, ℝn, f , ℝn
≤0) where f : ∆n × ℝn → ℝn is defined as
f (x(t), g(t)) = g(t) − ⟨g(t), x(t)⟩1, (12)
where 1 is the n-dimensional vector whose components are all 1. We will solve this Blackwell approachability
game using the strategy explained in Appendix C.
Computation of ψt (Step 2). Let’s start from looking at how to compute the projection ψt of φt onto
S = ℝ≤0. Projection onto the nonpositive orthant amounts to a component-wise minimum with 0, that is,
ψt = [φt]−. Hence,
φ(t) − ψ(t) = [φ(t)]+ =⇒ (φ(t) − ψ(t))⊤ψ(t) = 0.
Halfspace to be forced (Step 4). Following on with Blackwell’s algorithm, when [φ(t)]+ ̸ = 0, the halfspace
to be forced at each time t is
H(t) := {z ∈ ℝn : ⟨[φ(t)]+, z⟩ ≤ 0}.
Forcing action for H(t) (Step 5). We now show that a forcing action for H(t) indeed exists. Remember
that by definition, that is an action x∗ ∈ ∆n such that no matter the g ∈ ℝn, f (x∗, g) ∈ H(t). Expanding the
definition of H(t) and f , we are looking for a x∗ ∈ ∆n such that
⟨[φ(t)], g − ⟨g, x∗⟩1⟩ ≤ 0 ∀ g ∈ ℝn ⇐⇒ ⟨[φ(t)], g⟩ − ⟨g, x∗⟩⟨[φ(t)]+, 1⟩ ≤ 0 ∀ g ∈ ℝn
⇐⇒ ⟨[φ(t)], g⟩ − ⟨g, x∗⟩∥[φ(t)]+∥1 ≤ 0 ∀ g ∈ ℝn
⇐⇒
〈
g, [φ(t)]
∥[φ(t)]+∥1
〉
− ⟨g, x∗⟩ ≤ 0 ∀ g ∈ ℝn
⇐⇒
〈
g, [φ(t)]
∥[φ(t)]+∥1
− x∗
〉
≤ 0 ∀ g ∈ ℝn.
Note that we are lucky: [φ(t)]+/∥[φ(t)]∥1 is a nonnegative vector whose entries sum to 1. So, the above
inequality can be satisfied with equality for the choice
x∗ = [φ(t)]+
∥[φ(t)]+∥1
∈ ∆n.
In other words, we have that Blackwell’s algorithm in this case picks
x(t+1) = [φ(t)]+
∥[φ(t)]+∥1
∈ ∆n ⇐⇒ x(t+1) ∝ [φ(t)]+ ∝
[
r(t)]+
, where r(t) :=
t∑
τ =1
g(τ ) − ⟨g(τ ), x(τ )⟩1.
This is exactly the regret matching algorithm seen in Lecture 5.
1A differentiable function f is 1-strongly convex if for all x, y in its domain one has (∇f (x) − ∇f (y))⊤(x − y) ≥ ∥x − y∥2.

Content

Typo or question?

Metadata