Gabriele Farina - Learning in Games: $\Phi$-regret minimization

MIT 6.S890 — Topics in Multiagent Learning Tue, Oct 1st 2024
Lecture 8
Learning in games: Φ-regret minimization
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
As we have seen, external regret minimization is a very narrow instantiation of Φ-regret minimization
—perhaps the smallest sensible instantiation. Then, clearly, the problem of coming up with a regret
minimizer for a set 𝒳 cannot be harder than the problem of coming up with a Φ-regret minimizer
for 𝒳 for richer sets of transformation functions Φ. It might then seem surprising that there exists
a construction that reduces Φ-regret minimization to regret minimization. We see this general
construction in Section 2. First, we start with an example of this principle, whose discovery predates
the general construction of Section 2, and which happens to be a special case of it.
1 Blum-Mansour’s swap regret minimization algorithm
Let’s start by considering the task of constructing a swap regret minimizer for probability simplex
𝒳 = Δ𝑛 over 𝑛 actions 𝐴 = {1, ..., 𝑛}. As we mentioned in Lecture 4, when all players in a normal-
form game play according to the strategies generated by such an algorithm, the average product
strategy converges to the set of correlated equilibria of the game. The swap regret minimization
algorithm of [BM07] is a simple and elegant algorithm that minimizes swap regret by starting from
multiple copies of an external regret minimizer.
Recall that in swap regret minimization, the learner wants to minimize the Φ-regret with respect to
the set of strategy transformations Φ represented by stochastic matrices
Φ ≔
{
{{
{
{{
{
𝑃 =
(
((((( |
𝑝1
↓
|
𝑝2
↓
⋯ |
𝑝𝑛
↓
)
))))) : 𝑝1, ..., 𝑝𝑛 ∈ Δ𝑛
}
}}
}
}}
}
.
In order to construct a swap regret minimizer for 𝒳 = Δ𝑛, we start with |𝐴| copies of an external
regret minimizer for Δ𝑛, denoted by ℛ𝑖.
• To compute the NextStrategy() of the swap regret minimizer, we first call the NextStrategy()
of each ℛ𝑖 to obtain a distribution 𝑝(𝑡)
𝑖 ∈ Δ𝑛. We will then assemble all 𝑝(𝑡)
𝑖 into a stochastic
matrix 𝑃 (𝑡), and will output a fixed point
𝑥(𝑡) = 𝑃 (𝑡)𝑥(𝑡) = ∑
𝑛
𝑖=1
𝑥(𝑡)
𝑖 𝑝(𝑡)
𝑖 ∈ Δ𝑛.
• To compute the ObserveUtility(𝑢(𝑡)) of the swap regret minimizer, we first call the
ObserveUtility of each ℛ𝑖 with the linear utility function
𝑣(𝑡)
𝑖 ≔ 𝑥(𝑡)
𝑖 𝑢(𝑡)
that is obtained by rescaling 𝑢(𝑡) by the component 𝑥(𝑡)
𝑖 of the last-output strategy 𝑥(𝑡).
The process can be depicted pictorially as in Figure 1.
External regret
minim. for Δ𝑛
Assemble 𝑃 (𝑡)
(𝑝(𝑡)
1 | ⋯ | 𝑝(𝑡)
𝑛 )
External regret
minim. for Δ𝑛
External regret
minim. for Δ𝑛
Fixed point
𝑥(𝑡) = 𝑃 (𝑡)𝑥(𝑡)
Blum-Mansour’s swap regret minimizer
𝑢(𝑡)
𝑛 → ℝ
⋮
𝑥(𝑡)
1 𝑢(𝑡) 𝑝(𝑡)
1
𝑥(𝑡)
2 𝑢(𝑡) 𝑝(𝑡)
2
𝑥(𝑡)
𝑛 𝑢(𝑡)
𝑝(𝑡)
𝑛
𝑥(𝑡) ∈ Δ𝑛
We now claim that the algorithm described above is a swap regret minimizer for Δ𝑛.
Theorem 1.1. Let Reg(𝑇 )
𝑖 denote the regret incurred by each external regret minimizer ℛ𝑖 for
Δ𝑛 in the Blum-Mansour construction described above. Then, the swap regret cumulated by
the algorithm satisfies:
SwapReg(𝑇 ) ≤ ∑
𝑛
𝑖=1
Reg(𝑇 )
𝑖 .
In particular, the swap regret grows sublinearly in 𝑇 whenever the extenral regret minimizers
ℛ𝑖 guarantee sublinear regret.
Proof . By construction, the external regret incurred by each ℛ𝑖 is
Reg(𝑇 )
𝑖 = max
̃
𝑝𝑖∈Δ𝑛 ∑
𝑇
𝑡=1
(𝑣(𝑡)(̃𝑝𝑖) − 𝑣(𝑡)(𝑝(𝑡)
𝑖 )). (1)
Pick any ̂ 𝑃 = (̂𝑝1 | ... | ̂ 𝑝𝑛) ∈ Φ. Then, the swap regret cumulated compared to always transform-
ing strategies according to ̂ 𝑃 is, by definition,
∑
𝑇
𝑡=1
𝑢(𝑡)( ̂𝑃 𝑥(𝑡)) − 𝑢(𝑡)(𝑥(𝑡)) = ∑
𝑇
𝑡=1
𝑢(𝑡)( ̂𝑃 𝑥(𝑡)) − 𝑢(𝑡)(𝑃 (𝑡)𝑥(𝑡)) (𝑥(𝑡) = 𝑃 (𝑡)𝑥(𝑡))
= ∑
𝑇
𝑡=1
[(∑
𝑛
𝑖=1
𝑥(𝑡)
𝑖 𝑢(𝑡)(̂𝑝𝑖)) − (∑
𝑛
𝑖=1
𝑥(𝑡)
𝑖 𝑢(𝑡)(𝑝(𝑡)
𝑖 ))] (linearity of 𝑢(𝑡))
= ∑
𝑇
𝑡=1
(∑
𝑛
𝑖=1
𝑣(𝑡)
𝑖 (̂𝑝𝑖) − 𝑣(𝑡)
𝑖 (𝑝(𝑡)
𝑖 )) (definition of 𝑣(𝑡)
𝑖 )
= ∑
𝑛
𝑖=1
(∑
𝑇
𝑡=1
𝑣(𝑡)
𝑖 (̂𝑝𝑖) − 𝑣(𝑡)
𝑖 (𝑝(𝑡)
𝑖 )) (switching summation order)
≤ ∑
𝑛
𝑖=1
Reg(𝑇 )
𝑖 . (from (1))
Taking a maximum over all ̂ 𝑃 ∈ Φ concludes the proof. □
2 A general approach: Gordon-Greenwald-Marks’s reduction
Blum-Mansour’s swap regret minimization algorithm is a special case of a much more general
construction. Gordon, G. J., Greenwald, A., & Marks, C. [GGM08] show that Φ-regret minimization
for a strategy set 𝒳 can be constructed starting from the following two ingredients:
1. an external regret minimization for the set Φ; and
2. a fixed point oracle Φ, that is, an algorithm that given any 𝜙 ∈ Φ outputs a fixed point 𝜙(𝑥) =
𝑥 ∈ 𝒳.
Intuitively, the external regret minimizer for Φ has the role of tracking which transformation 𝜙 the
decision maker should focus on at each time time. The linear utility function 𝑈 (𝑡) : Φ → ℝ observed
by the external regret minimizer is constructed from the last-output strategy 𝑥(𝑡) and the utility
function 𝑢(𝑡) observed at time 𝑡, according to the formula
𝑈 (𝑡)(𝜙) = 𝑢(𝑡)(𝜙(𝑥(𝑡))), (2)
where 𝑥(𝑡) is the last-output strategy. The final construction is as follows:
• Each call to NextStrategy() first calls ℛ.NextStrategy() to obtain the next transformation 𝜙(𝑡).
Then, a fixed point 𝑥(𝑡) = 𝜙(𝑡)(𝑥(𝑡)) ∈ 𝒳 is computed and output.
• Each call to ObserveUtility(𝑢(𝑡)) with linear utility function 𝑢(𝑡) constructs the linear utility
function 𝑈 (𝑡) : 𝜙 ↦ 𝑢(𝑡)(𝜙(𝑥(𝑡))) given in (2), and passes it to ℛ via ℛ.ObserveUtility(𝑈 (𝑡)).
Graphically, we can summarize the process as in the following block diagram.
Utility
construction
in Φ space
External regret
minimizer for Φ
Fixed point
𝑥(𝑡) = 𝜙(𝑡)(𝑥(𝑡))
Φ-regret minimizer Φ-Reg(𝑇 )
Reg(𝑇 )
Φ
𝑢(𝑡)
𝒳 → ℝ
𝑈 (𝑡)
Φ → ℝ
𝜙(𝑡) ∈ Φ 𝑥(𝑡) ∈ 𝒳
Figure 2: Gordon-Greenwald-Marks’s construction of a Φ-regret minimizer.
Theorem 2.1 ([GGM08]). The Φ-regret Φ-Reg(𝑇 ) cumulated up to time 𝑇 by we have just
defined is exactly equal to the (external) cumulative regret Reg(𝑇 )
Φ cumulated by ℛ:
Φ-Reg(𝑇 ) = Reg(𝑇 )
Φ ∀𝑇 = 1, 2, ....
Because the regret cumulated by ℛ grows sublinearly by hypothesis of it being a regret
minimizer, then so does the Φ-regret of the Φ-regret minimization algorithm defined above.
Proof . The proof of correctness of the above construction is deceptively simple. Since ℛ outputs
transformations 𝜙(1), 𝜙(2), ... ∈ Φ and receives utilities 𝜙 ↦ 𝑢(1)(𝜙(𝑥(1))), 𝜙 ↦ 𝑢(2)(𝜙(𝑥(2))), ..., its
cumulative regret 𝑅(𝑇 ) is by definition
Reg(𝑇 )
Φ = max
̂
𝜙∈Φ
{∑
𝑇
𝑡=1
(𝑢(𝑡)( ̂𝜙(𝑥(𝑡))) − 𝑢(𝑡)(𝜙(𝑡)(𝑥(𝑡))))}.
Now, since by construction 𝑥(𝑡) is a fixed point of 𝜙(𝑡), 𝜙(𝑡)(𝑥(𝑡)) = 𝑥(𝑡), and therefore we can write
Reg(𝑇 )
Φ = max
̂
𝜙∈Φ
{∑
𝑇
𝑡=1
(𝑢(𝑡)( ̂𝜙(𝑥(𝑡))) − 𝑢(𝑡)(𝑥(𝑡)))},
where the right-hand side is exactly the cumulative Φ-regret Φ-Reg(𝑇 ) incurred by ℛΦ. □
Bibliography
[BM07] A. Blum and Y. Mansour, “From external to internal regret.,” Journal of Machine
Learning Research, vol. 8, no. 6, 2007.
[GGM08] G. J. Gordon, A. Greenwald, and C. Marks, “No-regret learning in convex games,” in
Proceedings of the 25th international conference on Machine learning, 2008, pp. 360–367.
Changelog
• Oct 1: Fixed typo in the description of Gordon-Greenwald-Marks’s reduction
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.

Content

Typo or question?

Metadata