Gabriele Farina - Markov (aka stochastic) games

MIT 6.S890 — Topics in Multiagent Learning Thu, Nov 14th 2024
Lecture 16
Markov (aka stochastic) games
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
In this last lecture for Part III, we take a quick peek at one last model of games that is especially
popular in (multi-agent) reinforcement learning: Markov games.
1 The model
The model of Markov games, also known as stochastic games, was introduced by Shapley, L. S.
[Sha53] as a generalization of the Markov decision process to the multi-agent setting. In this model,
the agents interact with each other and with the environment, and the environment is affected by
the joint actions of the agents. In this lecture, we will draw a distinction between infinite-horizon
games, and finite-horizon games (also known as episodic). We start from the former class.
Definition 1.1 (Infinite-horizon stochastic game). An 𝑚-player, infinite-horizon, finite state/
action space, stochastic (aka Markov) game is a tuple 𝐺 = (𝑆, 𝐴, ℙ, 𝑟, 𝛾, 𝜇) where
• 𝑆 is a finite set of states,
• 𝐴 = 𝐴1 × 𝐴2 × ... × 𝐴𝑚 is the set of joint actions,
• ℙ(𝑠′|𝑠, 𝑎), for 𝑠, 𝑠′ ∈ 𝑆, 𝑎 ∈ 𝐴 is the transition matrix of the environment,
• 𝑟 = (𝑟1, ..., 𝑟𝑚) is a tuple of reward functions, wjere 𝑟𝑖(𝑠,𝑎) specifies the reward of player 𝑖
for taking action 𝑎 in state 𝑠,
• 𝛾 ∈ (0, 1) is a discount factor, and
• 𝜇 ∈ Δ(𝑆) is the initial state distribution.
Given an infinite state-action sequence (𝑠(𝑡), 𝑎(𝑡))𝑡, each player derives a discounted utility
𝑢𝑖((𝑠(𝑡), 𝑎(𝑡))𝑡) ≔ ∑
𝑡≥0
𝛾𝑡 ⋅ 𝑟𝑖(𝑠(𝑡), 𝑎(𝑡)).
As the name suggests, an infinite-horizon stochastic game is played over an infinite number of steps.
The goal of each player is to maximize their discounted utility. By fixing the number of steps, we
instead obtain a finite-horizon stochastic game, as defined next.
Definition 1.2 (Finite-horizon stochastic game). In the finite-horizon case, the game is endowed
with an addition parameter 𝐻 ∈ ℕ, called the horizon, which indicates for how many steps the
game will unfold. For these games, a value of 𝛾 = 1 (i.e., no discounting) is acceptable since the
utility is a finite sum and therefore it cannot diverge.
2 Strategies: Markovianity and stationarity
In general, when we talk about a strategy, or policy, we allow for the possibility that the players
remember their past actions and transitions. In other words, when left unqualified, the term policy
allows for history-dependence and we think of it as a mapping
𝜋𝑖 : 𝑆 × (𝑆 × 𝐴)∗ → Δ(𝐴𝑖),
where the asterisk denotes a tuple of arbitrary length representing the history of play up to any point.
When further restrictions are imposed on how the policy can depend on the history, we arrive at
two important distinctions.
Definition 2.1 (Markovian policy). A policy is history-independent, or Markovian, if it only
depends on the current state and time. This means that given any two histories of the same
length, the policy is the same. In other words, the policy is a function
𝜋𝑖 : 𝑆 × ℕ → Δ(𝐴𝑖).
Definition 2.2 (Stationary policy). A policy is stationary if it only depends on the current state
and not even time. In other words, the policy is just a function of the current state
𝜋𝑖 : 𝑆 → Δ(𝐴𝑖).
(Note that stationarity implies Markovianity).
3 Equilibria in Markov games
3.1 The finite-horizon case
If we are seeking non-Markovian strategies, a finite-horizon stochastic game can just be “unrolled”
and converted into a perfect-recall extensive-form game.
When Markovian strategies are sought, then the process is not as straightforward. However, the
game can still be solved efficiently via backward induction. We illustrate this with an example.
The idea is to solve right-to-left the game, but inductively picking strategies that maximize the
immediate reward plus the continuation value 𝑉𝑖,𝑡(𝑠) at every state 𝑠.
backward induction
⋯
⋯
⋯
⋯
𝑡 = 0 𝑡 𝐻 − 1 ▶ Initialization:
• 𝑉𝑖,𝐻 ≔ 0 for all states 𝑠 ∈ 𝑆 and players 𝑖
▶ Inductive step (at time 𝑡):
• Assume given 𝑉𝑖,𝑡+1 : 𝑆 → ℝ.
• For each 𝑠 ∈ 𝑆, construct a game where 𝑖’s utility
is defined as
𝐹𝑖,𝑠(𝑎) ≔ 𝑟𝑖(𝑠, 𝑎) + 𝔼𝑠′∼ℙ(⋅|𝑠,𝑎)[𝑉𝑖,𝑡+1(𝑠′)].
• Compute a Nash equilibrium of the normal-form
game given by the utilities 𝐹 , and let that be
𝜋(⋅ |𝑠, 𝑡) ∈ Δ(𝐴).
• Set 𝑉𝑖,𝑡(𝑠) ≔ 𝔼𝑎∼𝜋(⋅|𝑠,𝑡)[𝐹𝑖,𝑠(𝑎)].
Finally, we remark that in finite-horizon games, stationary policies are typically not optimal, because
optimal behavior usually depends on the amount of time left. In infinite-horizon games, however,
equilibria in stationary policies exist, as we show next.
3.2 The infinite-horizon case
Theorem 3.1 ([Fin64; Tak64]). Every infinite-horizon discounted, stochastic game with a
finite number of states, actions, and players, has a Nash equilibrium in stationary, Markovian
strategies. More formally, there exists a collection of policies 𝜋1, ..., 𝜋𝑚 where 𝜋𝑖 : 𝑆 → Δ(𝐴𝑖)
such that
𝑢𝑖(𝜋𝑖, 𝜋−𝑖) ≥ 𝑢𝑖(𝜋′
𝑖 , 𝜋−𝑖) ∀𝑖, 𝜋′
𝑖 ,
where 𝜋′
𝑖 is any policy for player 𝑖, not necessarily Markovian.
Proof . Given a policy profile 𝜋 = (𝜋1, .., ., 𝜋𝑚) and a player 𝑖 ∈ [𝑚], we will introduce the
following notation:
• 𝑣𝜋
𝑖 (𝑠) , for 𝑠 ∈ 𝑆, is the infinite discounted utility of player 𝑖 if the game started at state 𝑠
and all players used policies 𝜋1, ..., 𝜋𝑚. In symbols,
𝑣𝜋
𝑖 (𝑠) = ∑
𝑎
𝑟𝑖(𝑠, 𝑎) ⋅ 𝜋(𝑎|𝑠)
⏟⏟⏟⏟⏟⏟⏟⏟⏟
≕ 𝑟𝜋
𝑖 (𝑠)
+ 𝛾 ∑
𝑠′
𝑣𝜋
𝑖 (𝑠′)∑
𝑎
𝜋(𝑎|𝑠) ⋅ ℙ(𝑠′|𝑠, 𝑎)
⏟⏟⏟⏟⏟⏟⏟⏟⏟
≕ Γ𝜋(𝑠, 𝑠′)
.
This is a linear system of equations in the variables 𝑣𝜋
𝑖 (𝑠), which we can rewrite more
compactly as
(𝐼 − 𝛾Γ𝜋)𝑣𝜋
𝑖 = 𝑟𝜋
𝑖 .
We now argue that the matrix 𝐼 − 𝛾Γ𝜋 is invertible. To see this, note that the matrix Γ𝜋 is
a stochastic matrix:
∑
𝑠′
Γ𝜋(𝑠, 𝑠′) = ∑
𝑠′
∑
𝑎
𝜋(𝑎|𝑠)ℙ(𝑠′|𝑠, 𝑎) = ∑
𝑎
𝜋(𝑎|𝑠)(∑
𝑠′
ℙ(𝑠′|𝑠, 𝑎)) = ∑
𝑎
𝜋(𝑎|𝑠) = 1.
Since 𝛾 < 1 by Definition 1.1, we have that 𝐼 − 𝛾Γ𝜋 is strictly diagonally dominant, which
implies that 𝐼 − 𝛾Γ𝜋 cannot be singular. Therefore, the system of equations has a unique
solution, which corresponds to
𝑣𝜋
𝑖 = (𝐼 − 𝛾Γ𝜋)−1𝑟𝜋
𝑖 .
Furthermore, the values 𝑣𝜋
𝑖 are continuous in the policies 𝜋.
• 𝑞𝜋
𝑖 (𝑠, 𝑎𝑖) , for 𝑠 ∈ 𝑆 and 𝑎𝑖 ∈ 𝐴𝑖, is the infinite discounted utility of player 𝑖 if the game
started at state 𝑠, and players used policies 𝜋1, ..., 𝜋𝑚, with the only exception that the very
first action of player 𝑖 is set to 𝑎𝑖. In symbols,
𝑞𝜋
𝑖 (𝑠, 𝑎𝑖) = ∑
𝑎−𝑖
𝑟𝑖(𝑠, 𝑎) ⋅ 𝜋−𝑖(𝑎−𝑖|𝑠) + 𝛾 ∑
𝑠′
𝑣𝜋
𝑖 (𝑠′) ∑
𝑎−𝑖
𝜋(𝑎−𝑖|𝑠) ⋅ ℙ(𝑠′|𝑠, 𝑎𝑖).
Like before, the function 𝑞𝜋
𝑖 is continuous in the policies 𝜋, since everything on the right-
hand side is continuous, including the 𝑣𝜋
𝑖 as discussed above. Furthermore,
𝑣𝜋
𝑖 (𝑠) = ∑
𝑎𝑖
𝜋𝑖(𝑎𝑖|𝑠) ⋅ 𝑞𝜋
𝑖 (𝑠, 𝑎𝑖). (1)
We now define a Nash-type function 𝜑, similar to what we used in Lecture 2, mapping policy
profiles to improved policy profiles as follows:
∀𝑖, 𝑠, 𝑎𝑖 : 𝜋′
𝑖 (𝑎𝑖|𝑠) ← 𝜋𝑖(𝑎𝑖|𝑠) + [𝑞𝜋
𝑖 (𝑠, 𝑎𝑖) − 𝑣𝜋
𝑖 (𝑠)]+
1 + ∑𝑎′
𝑖
[𝑞𝜋
𝑖 (𝑠, 𝑎′
𝑖) − 𝑣𝜋
𝑖 (𝑠)]+ .
This mapping is continuous over the convex compact set of all stationary Markov policy profiles.
Hence, by Brouwer’s fixed-point theorem, there exists a fixed point 𝜋∗ = 𝜑(𝜋∗).
To complete the proof, we then only have to argue that the fixed point 𝜋∗ is a Nash equilibrium.
Using the same logic as Lecture 2, we infer that
∀𝑖 ∈ [𝑚], 𝑠 ∈ 𝑆, and 𝑎𝑖 ∈ 𝐴𝑖, 𝑣𝜋∗
𝑖 (𝑠) ≥ 𝑞𝜋∗
𝑖 (𝑠, 𝑎𝑖). (2)
We need to show that, fixing 𝜋∗
−𝑖, 𝜋∗
𝑖 is a best response for each player 𝑖. From the point of view
of player 𝑖, computing a best response amounts to solving a Markov decision process (MDP)
supported on 𝑆 with rewards and transitions given by
̃
𝑟𝑖(𝑠, 𝑎𝑖) ≔ ∑
𝑎−𝑖
𝑟(𝑠, 𝑎) ⋅ 𝜋−𝑖(𝑎−𝑖|𝑠),
̃
ℙ(𝑠′|𝑠, 𝑎𝑖) ≔ ∑
𝑎−𝑖
𝑃 (𝑠′|𝑠, 𝑎) ⋅ 𝜋−𝑖(𝑎−𝑖|𝑠).
Equations (1) and (2) together imply that the expected discounted payoff ̃ 𝑣𝜋∗
𝑖
𝑖 (𝑠) starting at 𝑠 in
MDP satisfies
∀𝑠 ∈ 𝑆, ̃ 𝑣𝜋∗
𝑖
𝑖 (𝑠) = max
𝑎𝑖
{̃𝑟𝑖(𝑠, 𝑎𝑖) + 𝛾 ∑
𝑠′
̃ 𝑣𝜋∗
𝑖
𝑖 (𝑠′)̃ℙ(𝑠′|𝑠, 𝑎𝑖)}.
The previous condition is the Bellman equation for the MDP. From the theory of MDPs, we
conclude that 𝜋∗
𝑖 is an optimal policy for the MDP, and therefore 𝜋∗
𝑖 is a best response to 𝜋∗
−𝑖. □
From a computational point of view, finding equilibria in Markov games is a challenging task.
In general, the problem is open already in the two-player zero-sum case. However, this becomes
tractable if the discount factor 𝛾 is bounded away from 1, and the goal is approximate equilibria. The
computation of correlated and coarse correlated equilibria is also mostly open, with some hardness
results depending on the type of strategies under consideration.
3.3 Shapley’s theorem for two-player zero-sum Markov games
Finally, we mention a result by Shapley, L. S. [Sha53] that characterizes the value of two-player
zero-sum Markov games. The result is a generalization of the minimax theorem for two-player zero-
sum games.
Theorem 3.2 (Shapley’s minimax theorem). The value of a two-player zero-sum Markov game
is given by
𝑣 = max
𝜋1
min
𝜋2
𝑢1(𝜋1, 𝜋2) = min
𝜋2
max
𝜋1
𝑢1(𝜋1, 𝜋2),
where the max and min can be taken over all policies, Markov policies, and for infinite-horizon
games, even stationary Markov policies.
The proof of the theorem is based on showing contraction of a certain Bellman iterator.
Bibliography
[Sha53] L. S. Shapley, “Stochastic games,” Proceedings of the national academy of sciences, vol. 39,
no. 10, pp. 1095–1100, 1953.
[Tak64] M. Takahashi, “Equilibrium points of stochastic non-cooperative 𝑛-person games”, Journal
of Science of the Hiroshima University, Series AI (Mathematics), vol. 28, no. 1, pp. 95–
99, 1964.
[Fin64] A. M. Fink, “Equilibrium in a stochastic 𝑛-person game”, Journal of science of the hiroshima
university, series ai (mathematics), vol. 28, no. 1, pp. 89–93, 1964.
Changelog
• Nov 14, 2024: Fix two typos.
★These notes are class material that has not undergone formal peer review. The TA and I are grateful for any
reports of typos.

Content

Typo or question?

Metadata