#1: Kuhn Poker | Reading

Hunting Wabbits Game

(This game is modified from Lisy et al. 2015.)

Bugs and Fudd

We developed this game to show concepts in games with imperfect information that are applicable to our first challenge. Imperfect information games involve hidden information, which are things like opponent cards in poker or enemy location in the Hunting Wabbits game. In perfect information games like chess, all information is available to both players.

This game has one chance player (the Author), one maximizing player (Fudd) and one minimizing player (Bugs). Fudd and Bugs are in a long-term, all-out war, and so any energy that Fudd wastes or saves is a gain or loss for Bugs; our game is zero-sum.

First, the Author will choose whether to write an episode where Bugs is at the Opera (50%) or in the Forest (50%). Fudd cannot tell what the Author has chosen.

Next, Fudd will decide whether to Hunt_Near or Hunt_Far. If he chooses Hunt_Far, he takes -1 value for the extra effort.

Bugs knows whether the Author wrote him into the Opera or Forest, but he does not know Fudd’s hunting location. If the Author gives him the Opera, Bugs has no choices to make and the game ends after Fudd’s action. If Forest, Bugs will decide whether to Play_Near or Play_Far. If he plays in the same location as Fudd is hunting, Fudd gets +3 value for a successful hunt (for a total of +2 in the Hunt_Far action due to the -1 value for the extra effort). If they are in different locations (or if Bugs is at the Opera), Fudd will get 0 value for an unsuccessful hunt (-1 for the Hunt_Far misses).

Putting it all together, the payoff structure of this game is:

Author Bugs/Fudd Hunt_Near Hunt_Far
Opera 0, 0 +1, -1
Forest Play_Near -3, +3 +1, -1
Forest Play_Far 0, 0 -2, +2

The tree structure is as follows with the payoffs written from the perspective of Fudd:

If Fudd knows he’s at the Opera, then he must prefer to Hunt_Near to get a value of \(0\) instead of \(-1\) for Hunt_Far, but since he doesn’t know his location, he must take both scenarios into account.

Note that Bugs’s optimal actions depend on Fudd’s Opera strategy even though that outcome cannot be reached once Bugs is playing since Bugs only plays in the Forest! For example if we kept the same game tree except Fudd had a \(+100\) Opera Hunt_Far payoff, then he would always Hunt_Far. Bugs would see this and it would affect how Bugs plays in the Forest scenario.

Exercise

How would Bugs play in the Forest scenario knowing that Fudd is always playing Hunt_Far?

Bugs would always Play_Near because then his payoff in the only scenario he can control would be \(+1\), whereas Play_Far would get him a payout of \(-2\). (The payoffs are all written from the perspective of Fudd, so Bugs’s are opposite.)

Concept: Information Set

In a perfect information game, we can draw a tree of the possible states the game can be in. Every time a player is called to take an action, they will know what node they are at, and what node they will go to with each legal action.

In an imperfect information game, we can still draw that tree, but now a player might be called to take an action without knowing what node they are actually at. Instead, there will be a set of one or more nodes that are indistinguishable to that player based on what they have seen so far, and they will have to take an action knowing only that they are in that set. Such a set of nodes is an information set or an infoset.

The infosets contain information about the player and what actions have been seen so far. For example, [Fudd] is an infoset that contains the nodes [Fudd, Forest] and [Fudd, Opera]. (Just [Fudd] because no other actions/information have been revealed to Fudd at this point.)

A player strategy is a rule that says, for every information set that player will face, what action or (random choice of actions) that player will take. For a game like Hunting Wabbits or Kuhn Poker (the Challenge 1 game), we can list every information set and its probabilities. For a more complicated game, we might write our strategy as a computation that will output probabilities based on inputs and an algorithm.

Concept: Expected Value

Once we’ve assigned definite values to the ultimate outcomes, the expected value, or EV, of a situation is the value of each outcome weighted by the probability of that outcome.

Exercise

Suppose that Bugs plays uniform \(0.5\) Play_Near and \(0.5\) Play_Far.

  1. What is the value of each Bugs node and what should Fudd do if he knew Bugs’s actions? (Recall that the payoff values are from the perspective of Fudd and Bugs uses opposite values.)

  2. What is Fudd’s expected value in this case?

  1. Bugs’s node values are:

\[ \begin{align*} \mathbb{E}(\text{Left Node}) &= 0.5*-3 + 0.5*0 = -1.5 \\ \mathbb{E}(\text{Right Node}) &= 0.5*1 + 0.5*-2 = -0.5 \end{align*} \]

The values for Fudd are inverse, so Fudd prefers the Left Node, which has a value of \(1.5\). This means that if Fudd is in the Forest, he prefers to Hunt_Near.

  1. We computed that Fudd prefers Hunt_Near in the Forest scenario and can see on the game tree that he also prefers Hunt_Near in the Opera scenario, so can already know that he will always choose Hunt_Near. Fudd will then have the following expected values:

\[ \begin{equation} \begin{split} \mathbb{E}(\text{Hunt\_Near}) &= \Pr(\text{Opera})*u(\text{Hunt\_Near}) + \Pr(\text{Forest})*u(\text{Hunt\_Near}) \\ &= 0.5*0 + 0.5*1.5 \\ &= 0.75 \end{split} \end{equation} \]

In imperfect information games, we consider probabilities over two different sources of uncertainty, after assuming a particular P1 strategy and P2 strategy:

  1. Uncertainty about which node we are actually in, given that we know that we’re in one of multiple nodes that we can’t tell apart. The probabilities of being in node 1, node 2, … of an information set can be calculated by the probabilities of strategies upwards in the game tree (and the probabilites of chance events upwards in the game that have already happened). For example, Fudd doesn’t know if he’s in the Forest or Opera at the beginning.
  2. Uncertainty about what will happen after we go to a node downwards in the game tree, coming from chance events or strategy probabilities in the players’ following actions. For example, after Fudd selects Hunt_Near, there is uncertainty about the outcome since it depends on Bugs’s actions.

We will focus on zero-sum two-player games, so the value to one player is simply the negative of the value to the other. Therefore, we can represent value in the game as a single number that the maximizing player wishes to make positive and the minimizing player wishes to make negative.

We will focus on maximizing (or minimizing) expected value as our goal for all of session 1. One thing that makes it natural to care about expected value is that it’s usually the best way to predict what your score will be after a very large number of games, whether they are the same game or different from each other.

Concept: Regret

For a given P1 strategy and P2 strategy, a player has regret when they take an action at an infoset that was not the highest-EV action at that infoset. The amount of regret is the difference between the highest-EV action and the selected action.

Exercise

What is the regret for each action?

Action Regret
A
B
C
Action Regret
A 4
B 2
C 0

There are other things that “regret” can mean in English, that are separate from this technical concept:

  • Based on what chance events later happened, I wish I had taken a different action instead.

  • I was wrong about what strategy my opponent was playing, and I wish I had taken a different action instead.

However, we will use “regret” as a technical concept to mean how much worse actions that are not-highest-EV perform compared to highest-EV actions given a particular P1 strategy and P2 strategy.

Exercises

Information Set
  1. What are Fudd’s information set(s)?

  2. What are Bugs’s information set(s)?

  1. Both Fudd nodes are a single information set because when Fudd is making the Hunt_Near or Hunt_Far decision, he doesn’t know whether he’s in the Opera or the Forest, so his information is the same in both nodes. We can label these simply [Fudd].

  2. Both Bugs nodes are also a single information set because although Bugs knows that he’s in the Forest, he doesn’t know which action Fudd has taken, so his information is the same in both nodes. We can label these [Bugs, Forest].

Expected Value

Say that Fudd chooses to Hunt_Near with probability \(p\).

  1. At the Bugs infoset where Bugs knows he’s in the Forest, what is the expected value of choosing to Play_Near?

  2. What is the expected value of choosing to Play_Far?

(Reminder: The payoffs are from the perspective of Fudd and are the opposite for Bugs.)

  1. \(\mathbb{E}(\text{Play\_Near}) = -3*p + 1*(1-p) = -4*p + 1\)

  2. \(\mathbb{E}(\text{Play\_Far}) = 0*p + -2*(1-p) = 2*p - 2\)

Expected Value 2

Say that Bugs chooses to Play_Near with probability \(q\).

  1. What is Fudd’s expected value of choosing Hunt_Near at his infoset?

  2. What is Fudd’s EV of choosing Hunt_Far?

  1. \(\mathbb{E}(\text{Hunt\_Near}) = 0.5*0 +0.5*[3*q + 0*(1-q)] = 1.5*q\)

  2. \(\mathbb{E}(\text{Hunt\_Far}) = 0.5*(-1) + 0.5*[-1*q + 2*(1-q)] = -0.5 - 0.5*q + 1 - q = 0.5 - 1.5*q\)

Regret

Find a \(p\) and a \(q\) such that both:

  1. Bugs never chooses an action with regret

  2. Fudd never chooses an action with regret

In order to have no regret, the expected value of both actions should be equal

  1. For Bugs to never choose an action with regret, EV of Play_Near and EV of Play_Far should be equal.

\(\mathbb{E}(\text{Bugs Play\_Near}) = -4*p + 1\)

\(\mathbb{E}(\text{Bugs Play\_Far}) = 2*p - 2\)

Setting equal, we have \(-4*p + 1 = 2*p - 2 \Rightarrow 3 = 6*p \Rightarrow p = \frac{1}{2} = 0.5\)

\(\Pr(\text{Fudd Hunt\_Near}) = p = 0.5\)

\(\Pr(\text{Fudd Hunt\_Far}) = 1 - p = 0.5\)

  1. For Fudd to never choose an action with regret, EV of Hunt_Near and EV of Hunt_Far should be equal.

\(\mathbb{E}(\text{Fudd Hunt\_Near}) = 1.5*q\)

\(\mathbb{E}(\text{Fudd Hunt\_Far}) = 0.5 - 1.5*q\)

Setting equal, we have \(1.5*q = 0.5 - 1.5*q \Rightarrow 3*q = 0.5 \Rightarrow q = \frac{1}{6} = 0.167\)

\(\Pr(\text{Bugs Play\_Near}) = q = 0.167\)

\(\Pr(\text{Bugs Play\_Far}) = 1 - q = 0.833\)

Notice that each player is inducing the other player to have no regret (to be indifferent to both actions) by playing actions at these probabilities, which then result in an equilibrium as shown below:

We will go deeper into indifference in the next reading.

Game Value

What is the value of the game (i.e. the expected value as Fudd over the entire game) at the equilibrium strategies found in the previous question?

Since the game is zero-sum, we can find Fudd’s expected value and it is equivalent to the game value:

\[\begin{align} \mathbb{E} &= \Pr(\text{Opera}) * \Pr(\text{Hunt\_Near}) * 0 \\ & + \Pr(\text{Opera}) * \Pr(\text{Hunt\_Far}) * -1 \\ & + \Pr(\text{Forest}) * \Pr(\text{Hunt\_Near}) * \Pr(\text{Play\_Near}) * 3 \\ & + \Pr(\text{Forest}) * \Pr(\text{Hunt\_Near}) * \Pr(\text{Play\_Far}) * 0 \\ & + \Pr(\text{Forest}) * \Pr(\text{Hunt\_Far}) * \Pr(\text{Play\_Near}) * -1 \\ & + \Pr(\text{Forest}) * \Pr(\text{Hunt\_Far}) * \Pr(\text{Play\_Far}) * 2 \\ \\ &= 0.5 * 0.5 * 0 \\ &+ 0.5 * 0.5 * -1 \\ &+ 0.5 * 0.5 * 0.17 * 3 \\ &+ 0.5 * 0.5 * 0.83 * 0 \\ &+ 0.5 * 0.5 * 0.17 * -1 \\ &+ 0.5 * 0.5 * 0.83 * 2 \\ \\ &= 0 \\ &+ -0.25 \\ &+ 0.125 \\ &+ 0 \\ &+ -0.042 \\ &+ 0.42 \\ \\ &= 0.25 \\ \end{align}\]

This means that if they play repeatedly, we expect Fudd to average a payoff of \(+0.25\), while Bugs has a payoff of \(-0.25\).