MIT 6.S890 — Topics in Multiagent Learning Thu, Sep 26th 2024
Lecture 7
Learning in games: Bandit feedback
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
The mathematical abstraction of a repeated decision maker we considered so far assumes that the
entire utility function—evaluated in the strategies of the other players in the environment—is given
to the decision maker as feedback. This is a strong assumption, and in many cases, the decision
maker only receives partial feedback. In this lecture, we consider the case where the decision maker
receives feedback only on the utility of the chosen action. This is known as the bandit feedback
setting.
1 Setup and general considerations
A bandit regret minimizer is identical to a full-information regret minimizer, except that the
ObserveUtility(𝑢(𝑡)) operation—where 𝑢(𝑡) : 𝑥 ↦ ⟨𝑔(𝑡), 𝑥⟩ is the utility function picked by the envi-
ronment in the full-information case—is replaced with ObserveUtility(𝑤(𝑡)), where 𝑤(𝑡) is the scalar
𝑤(𝑡) ≔ ⟨𝑔(𝑡), 𝑥(𝑡)⟩ ∈ ℝ.
As a general principle, algorithms for the bandit setting are constructed from regret minimizers for
the full-information setting. Indeed, the key idea is to construct an estimator ̃ 𝑔(𝑡) of the (unobserved)
utility gradient 𝑔(𝑡), and feed that into a full-information regret minimizer. The estimator is
constructed from the observed utility 𝑤(𝑡) and the chosen action 𝑥(𝑡).
The utility function can still be picked adversarially by the environment. However, to get guarantees,
it is necessary to reduce the power of the environment by letting the utility 𝑢(𝑡) only depend on
𝑥(1), ..., 𝑥(𝑡−1) but not on 𝑥(𝑡). In other words, the environment can pick the utility adaptively, but
must decide the utility before the learner picks the strategy, and not after. This restriction still allows
convergence to equilibria.
Example 1.1. Typically, the construction of bandit algorithms follows the following template:
Exploration
term
Strategy
sampler
Full-info.
regr. minim.
Gradient
Estimator
Bandit regret
minimizer
𝑤(𝑡) ̃ 𝑔(𝑡) 𝑦(𝑡)
∈ 𝒳
𝑝(𝑡)
∈ 𝒳
𝜉(𝑡) ∈ 𝒳
𝑥(𝑡) ∈ 𝒳
(← for high-prob.
regret bounds only)
The exploration term can be ignored if regret bounds in expectation are sought. Its role becomes
important when high-probability guarantees are sought instead. We explain the difference next.
■ Stochastic regret guarantees. Because the estimation is stochastic, the regret of a bandit algorithm
is a random variable. This adds a layer of complexity when approaching the analysis of bandit
algorithms. As a rule of thumb, three “flavors” of guarantees tend to be considered in the literature.
We will list them from weakest (and easiest to obtain) to strongest (and hardest to obtain):
• Guarantees on the pseudoregret, of the form
PseudoReg(𝑇 ) ≔ max
̂
𝑥∈𝒳 𝔼[∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩] = 𝑜(𝑇 ).
• Guarantees on the expected regret, of the form
𝔼[Reg(𝑇 )] ≔ 𝔼[max
̂
𝑥∈𝒳 ∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩] = 𝑜(𝑇 ).
Note the change of order between the expectation and the maximum compared with the
pseudoregret introduced in the previous bullet point.
• High-probability regret guarantees, typically of the form
ℙ[max
̂
𝑥∈𝒳 ∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩ ≤ 𝑜(𝑇 )√log 1
𝛿 ] ≥ 1 − 𝛿
for any 𝛿 > 0 small enough.
Pseudoregret and expected regret guarantees are different, since max 𝔼 ≤ 𝔼 max, but the converse is
not true in general. Guarantees on the pseudoregret are not strong enough to conclude convergence
to the set of equilibria, in general.
2 Adversarial bandit learning in normal-form games
Let’s start from the case of normal-form games, in which our decision maker faces the choice of
picking an action out of a finite set 𝐴. The setting in this case is also known as adversarial multi-
armed bandit problem. We have 𝒳 = Δ(𝐴).
■ Strategy sampler. In this settings, most algorithms use the natural strategy sampler: given a
distribution 𝑝(𝑡) ∈ Δ(𝐴), the decision maker samples an action 𝑎(𝑡) ∈ 𝐴 according to the probabilities
in 𝑝(𝑡). The vector 𝑥(𝑡) is then set to the deterministic distribution 𝑒𝑎(𝑡) . Clearly, 𝔼𝑡[𝑥(𝑡)] = 𝑝(𝑡).
■ Gradient estimator. For this setting, the standard gradient estimator is the importance sampling
estimator. Given the utility scalar 𝑤(𝑡) ∈ [0, 1], the importance sampling estimator is defined as
̃
𝑔(𝑡) ≔
(
(( 𝑤(𝑡)
𝑝(𝑡)
𝑎(𝑡) )
))𝑒𝑎(𝑡) ∈ ℝ𝐴.
Theorem 2.1. Let 𝑤(𝑡) = ⟨𝑔(𝑡), 𝑥(𝑡)⟩ where 𝑔(𝑡) is some unknown utility gradient. Then, the
importance sampling estimator ̃ 𝑔(𝑡) is unbiased, that is, 𝔼𝑡[̃𝑔(𝑡)] = 𝑔(𝑡).
Proof . The result follows by direct calculation. The randomness is due to the sampling of the
action 𝑎(𝑡). Each action 𝑎 ∈ 𝐴 is sampled with probability 𝑝(𝑡)
𝑎 . Hence,
𝔼𝑡[̃𝑔(𝑡)] = ∑
𝑎∈𝐴
𝑝(𝑡)
𝑎 (𝑤(𝑡)
𝑝(𝑡)
𝑎
)𝑒𝑎 = ∑
𝑎∈𝐴
𝑝(𝑡)
𝑎 (⟨𝑔(𝑡), 𝑒𝑎⟩
𝑝(𝑡)
𝑎
)𝑒𝑎 = ∑
𝑎∈𝐴
𝑔(𝑡)
𝑎 𝑒𝑎 = 𝑔(𝑡).
□
2.1 The Exp3 algorithm
The Exp3 (short for “exponential weights for exploration and exploitation”) algorithm, introduced
by Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. [Aue+02], applies the multiplicative
weights update (MWU) algorithm on the importance sampling estimator. No exploration term is
added, so that the deterministic strategy 𝑥(𝑡) is sampled from the 𝑦(𝑡) output by MWU directly (see
also Example 1.1).
It is important to note that the analysis of MWU we did in Lectures 5 and 6 does not apply well to
analyze the regret incurred by the full-information regret minimizer. The issue is that the estimated
utilities potentially have a large range due to the division by the probabilities 𝑝(𝑡)
𝑎 . However, a better
analysis of MWU in this case is possible.
Theorem 2.2. If the regret minimizer is set to MWU with learning rate 𝜂 = √log|𝐴|/(𝑇 |𝐴|),
the Exp3 algorithm guarantees pseudoregret
PseudoReg(𝑇 ) = 𝑂(√𝑇 |𝐴| log|𝐴|).
2.2 Tsallis entropy
It can be shown that, information theoretically, no bandit learning algorithm for a finite set of
actions |𝐴| can achieve better than Ω(√𝑇 |𝐴|) expected regret in general. The regret guaranteed
by the Exp3 algorithm is therefore optimal only up to a logarithmic factor. It remained open for a
long time whether this logarithmic factor could be removed. A positive answer was given recently
by Audibert, J.-Y., & Bubeck, S. [AB10], who proposed the idea of replacing the MWU algorithm
with the FTRL algorithm instantiated with the (1/2)-Tsallis entropy regularizer
𝜓(𝑥) = 2 − 2 ∑
𝑎∈𝐴
√𝑥𝑎.
Theorem 2.3. If the regret minimizer is set to FTRL algorithm with (1/2)-Tsallis entropy and
learning rate 𝜂 = √1/𝑇 , the resulting bandit algorithm guarantees pseudoregret
PseudoReg(𝑇 ) = 𝑂(√𝑇 |𝐴|),
which is the optimal bound for bandit learning on finite probability distributions.
A simplified analysis can also be found in [ZS21].
2.3 The Exp3.P algorithm
The Exp3.P algorithm, introduced by Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R.
E. [Aue+02], is a variant of the Exp3 algorithm to achieve high-probability regret guarantees.
Intuitively, the difficulty with getting high-probabilty bounds for the regret in Exp3 is due to the
importance sampling: the gradient estimator has entries of magnitude inversely proportional to the
prbabilities output by MWU. This makes the variance of the estimator large, and the concentration
of the regret around its expectation difficult. To sidestep the issue, the Exp3.P algorithm uses the
idea of superimposing a uniform exploration term to the output 𝑦(𝑡) of MWU. More specifically, the
input to the strategy sampler is set to
𝑝(𝑡) ≔ (1 − 𝛾)𝑦(𝑡) + 𝛾 𝟏
|𝐴| ∈ Δ(𝐴),
where 𝛾 ∈ [0, 1] is a parameter.
The exploration term increases the exploration of the algorithm, reducing the variance of the
estimator. However, it is important to observe that this correction incurs a regret penalty due to the
fact that MWU recommended 𝑦(𝑡), and yet the decision maker sampled from 𝑝(𝑡). The effect of such
misalignment grows with the exploration parameter 𝛾. Nonetheless, the following can be shown.
Theorem 2.4 ([AR09]). Consider the Exp3.P algorithm with exploration parameter 𝛾 = √|𝐴|/𝑇
and learning rate 𝜂 = √log|𝐴|/(𝑇 |𝐴|). Then, for any 𝛿 ∈ (0, 1),
ℙ[Reg(𝑇 ) ≤ 𝑂(√𝑇 |𝐴| log |𝐴|
𝛿 )] ≥ 1 − 𝛿.
3 Adversarial bandit learning on general convex domains
Today, we know that bandit optimization is possible well past probability simplexes. In fact, we can
construct bandit algorithms for any convex and compact domain 𝒳 ⊆ ℝ𝑑. In particular, we mention
the general general result by Abernethy, J. D., Hazan, E., & Rakhlin, A. [AHR08], who showed that
the a bandit algorithm can be constructed starting from a full-information regret miminizer build
using the FTRL algorithm with a self-concordant distance-generating function.
Bibliography
[Aue+02] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed
bandit problem,” SIAM journal on computing, vol. 32, no. 1, pp. 48–77, 2002.
[AB10] J.-Y. Audibert and S. Bubeck, “Regret bounds and minimax policies under partial
monitoring,” The Journal of Machine Learning Research, vol. 11, pp. 2785–2836, 2010.
[ZS21] J. Zimmert and Y. Seldin, “Tsallis-INF: An optimal algorithm for stochastic and adver-
sarial bandits,” Journal of Machine Learning Research, vol. 22, no. 28, pp. 1–49, 2021.
[AR09] J. Abernethy and A. Rakhlin, “Beating the Adaptive Bandit with High Probability,”
Jan. 2009. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/
EECS-2009-10.html
[AHR08] J. D. Abernethy, E. Hazan, and A. Rakhlin, “Competing in the Dark: An Efficient
Algorithm for Bandit Linear Optimization.,” in COLT, 2008, pp. 263–274.
Changelog
• Sep 27: fixed a typo in Thereom 2.3.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
Lecture 7
Learning in games: Bandit feedback
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
The mathematical abstraction of a repeated decision maker we considered so far assumes that the
entire utility function—evaluated in the strategies of the other players in the environment—is given
to the decision maker as feedback. This is a strong assumption, and in many cases, the decision
maker only receives partial feedback. In this lecture, we consider the case where the decision maker
receives feedback only on the utility of the chosen action. This is known as the bandit feedback
setting.
1 Setup and general considerations
A bandit regret minimizer is identical to a full-information regret minimizer, except that the
ObserveUtility(𝑢(𝑡)) operation—where 𝑢(𝑡) : 𝑥 ↦ ⟨𝑔(𝑡), 𝑥⟩ is the utility function picked by the envi-
ronment in the full-information case—is replaced with ObserveUtility(𝑤(𝑡)), where 𝑤(𝑡) is the scalar
𝑤(𝑡) ≔ ⟨𝑔(𝑡), 𝑥(𝑡)⟩ ∈ ℝ.
As a general principle, algorithms for the bandit setting are constructed from regret minimizers for
the full-information setting. Indeed, the key idea is to construct an estimator ̃ 𝑔(𝑡) of the (unobserved)
utility gradient 𝑔(𝑡), and feed that into a full-information regret minimizer. The estimator is
constructed from the observed utility 𝑤(𝑡) and the chosen action 𝑥(𝑡).
The utility function can still be picked adversarially by the environment. However, to get guarantees,
it is necessary to reduce the power of the environment by letting the utility 𝑢(𝑡) only depend on
𝑥(1), ..., 𝑥(𝑡−1) but not on 𝑥(𝑡). In other words, the environment can pick the utility adaptively, but
must decide the utility before the learner picks the strategy, and not after. This restriction still allows
convergence to equilibria.
Example 1.1. Typically, the construction of bandit algorithms follows the following template:
Exploration
term
Strategy
sampler
Full-info.
regr. minim.
Gradient
Estimator
Bandit regret
minimizer
𝑤(𝑡) ̃ 𝑔(𝑡) 𝑦(𝑡)
∈ 𝒳
𝑝(𝑡)
∈ 𝒳
𝜉(𝑡) ∈ 𝒳
𝑥(𝑡) ∈ 𝒳
(← for high-prob.
regret bounds only)
The exploration term can be ignored if regret bounds in expectation are sought. Its role becomes
important when high-probability guarantees are sought instead. We explain the difference next.
■ Stochastic regret guarantees. Because the estimation is stochastic, the regret of a bandit algorithm
is a random variable. This adds a layer of complexity when approaching the analysis of bandit
algorithms. As a rule of thumb, three “flavors” of guarantees tend to be considered in the literature.
We will list them from weakest (and easiest to obtain) to strongest (and hardest to obtain):
• Guarantees on the pseudoregret, of the form
PseudoReg(𝑇 ) ≔ max
̂
𝑥∈𝒳 𝔼[∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩] = 𝑜(𝑇 ).
• Guarantees on the expected regret, of the form
𝔼[Reg(𝑇 )] ≔ 𝔼[max
̂
𝑥∈𝒳 ∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩] = 𝑜(𝑇 ).
Note the change of order between the expectation and the maximum compared with the
pseudoregret introduced in the previous bullet point.
• High-probability regret guarantees, typically of the form
ℙ[max
̂
𝑥∈𝒳 ∑
𝑇
𝑡=1
⟨𝑔(𝑡), ̂ 𝑥⟩ − ∑
𝑇
𝑡=1
⟨𝑔(𝑡), 𝑥(𝑡)⟩ ≤ 𝑜(𝑇 )√log 1
𝛿 ] ≥ 1 − 𝛿
for any 𝛿 > 0 small enough.
Pseudoregret and expected regret guarantees are different, since max 𝔼 ≤ 𝔼 max, but the converse is
not true in general. Guarantees on the pseudoregret are not strong enough to conclude convergence
to the set of equilibria, in general.
2 Adversarial bandit learning in normal-form games
Let’s start from the case of normal-form games, in which our decision maker faces the choice of
picking an action out of a finite set 𝐴. The setting in this case is also known as adversarial multi-
armed bandit problem. We have 𝒳 = Δ(𝐴).
■ Strategy sampler. In this settings, most algorithms use the natural strategy sampler: given a
distribution 𝑝(𝑡) ∈ Δ(𝐴), the decision maker samples an action 𝑎(𝑡) ∈ 𝐴 according to the probabilities
in 𝑝(𝑡). The vector 𝑥(𝑡) is then set to the deterministic distribution 𝑒𝑎(𝑡) . Clearly, 𝔼𝑡[𝑥(𝑡)] = 𝑝(𝑡).
■ Gradient estimator. For this setting, the standard gradient estimator is the importance sampling
estimator. Given the utility scalar 𝑤(𝑡) ∈ [0, 1], the importance sampling estimator is defined as
̃
𝑔(𝑡) ≔
(
(( 𝑤(𝑡)
𝑝(𝑡)
𝑎(𝑡) )
))𝑒𝑎(𝑡) ∈ ℝ𝐴.
Theorem 2.1. Let 𝑤(𝑡) = ⟨𝑔(𝑡), 𝑥(𝑡)⟩ where 𝑔(𝑡) is some unknown utility gradient. Then, the
importance sampling estimator ̃ 𝑔(𝑡) is unbiased, that is, 𝔼𝑡[̃𝑔(𝑡)] = 𝑔(𝑡).
Proof . The result follows by direct calculation. The randomness is due to the sampling of the
action 𝑎(𝑡). Each action 𝑎 ∈ 𝐴 is sampled with probability 𝑝(𝑡)
𝑎 . Hence,
𝔼𝑡[̃𝑔(𝑡)] = ∑
𝑎∈𝐴
𝑝(𝑡)
𝑎 (𝑤(𝑡)
𝑝(𝑡)
𝑎
)𝑒𝑎 = ∑
𝑎∈𝐴
𝑝(𝑡)
𝑎 (⟨𝑔(𝑡), 𝑒𝑎⟩
𝑝(𝑡)
𝑎
)𝑒𝑎 = ∑
𝑎∈𝐴
𝑔(𝑡)
𝑎 𝑒𝑎 = 𝑔(𝑡).
□
2.1 The Exp3 algorithm
The Exp3 (short for “exponential weights for exploration and exploitation”) algorithm, introduced
by Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. [Aue+02], applies the multiplicative
weights update (MWU) algorithm on the importance sampling estimator. No exploration term is
added, so that the deterministic strategy 𝑥(𝑡) is sampled from the 𝑦(𝑡) output by MWU directly (see
also Example 1.1).
It is important to note that the analysis of MWU we did in Lectures 5 and 6 does not apply well to
analyze the regret incurred by the full-information regret minimizer. The issue is that the estimated
utilities potentially have a large range due to the division by the probabilities 𝑝(𝑡)
𝑎 . However, a better
analysis of MWU in this case is possible.
Theorem 2.2. If the regret minimizer is set to MWU with learning rate 𝜂 = √log|𝐴|/(𝑇 |𝐴|),
the Exp3 algorithm guarantees pseudoregret
PseudoReg(𝑇 ) = 𝑂(√𝑇 |𝐴| log|𝐴|).
2.2 Tsallis entropy
It can be shown that, information theoretically, no bandit learning algorithm for a finite set of
actions |𝐴| can achieve better than Ω(√𝑇 |𝐴|) expected regret in general. The regret guaranteed
by the Exp3 algorithm is therefore optimal only up to a logarithmic factor. It remained open for a
long time whether this logarithmic factor could be removed. A positive answer was given recently
by Audibert, J.-Y., & Bubeck, S. [AB10], who proposed the idea of replacing the MWU algorithm
with the FTRL algorithm instantiated with the (1/2)-Tsallis entropy regularizer
𝜓(𝑥) = 2 − 2 ∑
𝑎∈𝐴
√𝑥𝑎.
Theorem 2.3. If the regret minimizer is set to FTRL algorithm with (1/2)-Tsallis entropy and
learning rate 𝜂 = √1/𝑇 , the resulting bandit algorithm guarantees pseudoregret
PseudoReg(𝑇 ) = 𝑂(√𝑇 |𝐴|),
which is the optimal bound for bandit learning on finite probability distributions.
A simplified analysis can also be found in [ZS21].
2.3 The Exp3.P algorithm
The Exp3.P algorithm, introduced by Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R.
E. [Aue+02], is a variant of the Exp3 algorithm to achieve high-probability regret guarantees.
Intuitively, the difficulty with getting high-probabilty bounds for the regret in Exp3 is due to the
importance sampling: the gradient estimator has entries of magnitude inversely proportional to the
prbabilities output by MWU. This makes the variance of the estimator large, and the concentration
of the regret around its expectation difficult. To sidestep the issue, the Exp3.P algorithm uses the
idea of superimposing a uniform exploration term to the output 𝑦(𝑡) of MWU. More specifically, the
input to the strategy sampler is set to
𝑝(𝑡) ≔ (1 − 𝛾)𝑦(𝑡) + 𝛾 𝟏
|𝐴| ∈ Δ(𝐴),
where 𝛾 ∈ [0, 1] is a parameter.
The exploration term increases the exploration of the algorithm, reducing the variance of the
estimator. However, it is important to observe that this correction incurs a regret penalty due to the
fact that MWU recommended 𝑦(𝑡), and yet the decision maker sampled from 𝑝(𝑡). The effect of such
misalignment grows with the exploration parameter 𝛾. Nonetheless, the following can be shown.
Theorem 2.4 ([AR09]). Consider the Exp3.P algorithm with exploration parameter 𝛾 = √|𝐴|/𝑇
and learning rate 𝜂 = √log|𝐴|/(𝑇 |𝐴|). Then, for any 𝛿 ∈ (0, 1),
ℙ[Reg(𝑇 ) ≤ 𝑂(√𝑇 |𝐴| log |𝐴|
𝛿 )] ≥ 1 − 𝛿.
3 Adversarial bandit learning on general convex domains
Today, we know that bandit optimization is possible well past probability simplexes. In fact, we can
construct bandit algorithms for any convex and compact domain 𝒳 ⊆ ℝ𝑑. In particular, we mention
the general general result by Abernethy, J. D., Hazan, E., & Rakhlin, A. [AHR08], who showed that
the a bandit algorithm can be constructed starting from a full-information regret miminizer build
using the FTRL algorithm with a self-concordant distance-generating function.
Bibliography
[Aue+02] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed
bandit problem,” SIAM journal on computing, vol. 32, no. 1, pp. 48–77, 2002.
[AB10] J.-Y. Audibert and S. Bubeck, “Regret bounds and minimax policies under partial
monitoring,” The Journal of Machine Learning Research, vol. 11, pp. 2785–2836, 2010.
[ZS21] J. Zimmert and Y. Seldin, “Tsallis-INF: An optimal algorithm for stochastic and adver-
sarial bandits,” Journal of Machine Learning Research, vol. 22, no. 28, pp. 1–49, 2021.
[AR09] J. Abernethy and A. Rakhlin, “Beating the Adaptive Bandit with High Probability,”
Jan. 2009. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/
EECS-2009-10.html
[AHR08] J. D. Abernethy, E. Hazan, and A. Rakhlin, “Competing in the Dark: An Efficient
Algorithm for Bandit Linear Optimization.,” in COLT, 2008, pp. 263–274.
Changelog
• Sep 27: fixed a typo in Thereom 2.3.
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.