#3 Rock Paper Scissors: Reading 1

Rock defeats scissors, scissors defeats paper, and paper defeats rock. You get 1 point for winning, -1 point for losing, and 0 points for ties.

The goal of this challenge is to focus on tracking opponent distributions and how to respond to them.

Game Theory

RPS is a zero-sum game and the payouts are symmetrical as follows:

Player 1/2 Rock Paper Scissors
Rock (0, 0) (-1, 1) (1, -1)
Paper (1, -1) (0, 0) (-1, 1)
Scissors (-1, 1) (1, -1) (0, 0)

The Nash Equilibrium strategy is to play each action \(r = p = s = 1/3\) of the time.

If Player 1 plays Rock with probability \(r\), Paper with probability \(p\), and Scissors with probability \(s\), we have the following expected value equations for Player 2:

\(\mathbb{E}(\text{R}) = -1*p + 1*s\)

\(\mathbb{E}(\text{P}) = 1*r - 1*s\)

\(\mathbb{E}(\text{S}) = -1*r + 1*p\)

Since no action dominates, we know that the EV of every strategic action should be equal.

\(\mathbb{E}(\text{R}) = \mathbb{E}(\text{P})\)

\(-1*p + 1*s = 1*r - 1*s\)

\(2*s = p + r\)

\(\mathbb{E}(\text{R}) = \mathbb{E}(\text{S})\)

\(-1*p + 1*s = -1*r + 1*p\)

\(s + r = 2*p\)

\(\mathbb{E}(\text{P}) = \mathbb{E}(\text{S})\)

\(1*r - 1*s = -1*r + 1*p\)

\(2*r = s + p\)

Solving the equations:

\(r = 2*s - p\)

\(s + (2*s - p) = 2*p\)

\(3*s = 3*p\)

\(s = p\)

\(s + r = 2*p\)

\(s + r = 2*s\)

\(r = s\)

We know that all are equal:

\(s = p = r\)

We also know that they must all sum to \(1\):

\(r + p + s = 1\)

Since they’re all equal and sum to \(1\), we can substitute \(p\) and \(s\) with \(r\):

\(3*r = 1\)

\(r = 1/3\)

So all actions are taken with probability \(1/3\):

\(r = p = s = 1/3\)

Playing this strategy means that whatever your opponent does, you will breakeven! You can see this interactively below:

Types of Opponent Adaptation

  1. Offline: In Kuhn Poker, we discussed building an agent based on weights that play optimally against a specific fixed opponent or group of opponents.
Best Strategy in RPS Against Fixed Opponent

How would you maximize in RPS knowing the opponent plays a fixed non-Nash strategy that you know? You can try this by playing with the above interactive with the 40% Rock option.

If we knew that our opponent played Rock 100% of the time, we should play Paper 100% of the time.

In general, if our opponent plays a mixed strategy, there is always a best pure strategy that is the move that beats their most-played action.

  1. Online: Now in this challenge we want to think about online opponent adaptation. Suppose now that we don’t know how our opponent is going to play, but want to build an agent that adapts to them in real time as the match is going on.

Counterfactual Regret Minimization

As we think about solving larger games, we start to look at iterative algorithms.

The most popular method for iteratively solving poker games is the Counterfactual Regret Minimization (CFR) algorithm. CFR is an iterative algorithm developed in 2007 at the University of Alberta that converges to Nash equilibrium in two player zero-sum games.

(Note: The handout solver does not exactly use CFR. You can make updates to the solver however you would like, including modifying it to become CFR.)

What is a counterfactual? Here’s an example:

Actual event: I didn’t bring an umbrella, and I got wet in the rain

Counterfactual event: If I had brought an umbrella, I wouldn’t have gotten wet

Regret and Strategies

A strategy at an infoset is a probability distribution over each possible action.

Regret is a measure of how much each strategy at an infoset is preferred and is used as a way to update strategies.

For a given P1 strategy and P2 strategy, a player has regret when they take an action at an infoset that was not the highest-EV action at that infoset.

Regret Exercise

What is the regret for each action?

Action Regret
A
B
C
Action Regret
A 4
B 2
C 0
Expected Value Exercise

If taking a uniform strategy at this node (i.e. \(\frac{1}{3}\) for each action), then what is the expected value of the node?

\(\mathbb{E} = \frac{1}{3}*1 + \frac{1}{3}*3+\frac{1}{3}*5 = 0.33+1+1.67 = 3\)

Poker Regret Exercise

In poker games, the regret for each action is defined as the value for that action minus the expected value of the node. Give the regret values for each action under this definition.

Action Value Poker Regret
A 1 -2
B 3 0
C 5 2

At a node in a poker game, the player prefers actions with higher regrets by this definition.

Each infoset maintains a strategy and regret tabular counter for each action. These accumulate the sum of all strategies and the sum of all regrets.

In a game like Rock Paper Scissors, there is effectively only one infoset, so only one table for strategy over each action (Rock, Paper, Scissors) and one table for regret over each action (Rock, Paper, Scissors).

Regrets are linked to strategies through a policy called regret matching.

Regret Matching

In general, we define regret as:

\(\text{Regret} = u(\text{Alternative Strategy}) − u(\text{Current Strategy})\)

We prefer alternative actions with high regret and wish to minimize our overall regret.

We play Rock and opponent plays Paper \(\implies \text{u(rock,paper)} = -1\)

\(\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(rock,paper)} = 1-(-1) = 2\)

\(\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(rock,paper)} = 0-(-1) = 1\)

\(\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(rock,paper)} = -1-(-1) = 0\)

We play Scissors and opponent plays Paper \(\implies \text{u(scissors,paper)} = 1\)

\(\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(scissors,paper)} = 1-1 = 0\)

\(\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(scissors,paper)} = 0-1 = -1\)

\(\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(scissors,paper)} = -1-1 = -2\)

We play Paper and opponent plays Paper \(\implies \text{u(paper,paper)} = 0\)

\(\text{Regret(scissors)} = \text{u(scissors,paper)} - \text{u(paper,paper)} = 1-0 = 1\)

\(\text{Regret(paper)} = \text{u(paper,paper)} - \text{u(paper,paper)} = 0-0 = 0\)$

\(\text{Regret(rock)} = \text{u(rock,paper)} - \text{u(paper,paper)} = -1-0 = -1\)

To generalize:

  • The action played always gets a regret of 0 since the “alternative” is really just that same action
  • When we play a tying action, the alternative losing action gets a regret of -1 and the alternative winning action gets a regret of +1
  • When we play a winning action, the alternative tying action gets a regret of -1 and the alternative losing action gets a regret of -2
  • When we play a losing action, the alternative winning action gets a regret of +2 and the alternative tying action gets a regret of +1

After each play, we accumulate regrets for each of the 3 actions.

We decide our strategy probability distribution using regret matching, which means playing a strategy that normalizes over the positive accumulated regrets, i.e. playing in proportion to the positive regrets.

Example from Marc Lanctot’s CFR Tutorial:

  • Game 1: Choose Rock and opponent chooses Paper

    • Lose 1
    • Rock: Regret 0
    • Paper: Regret 1
    • Scissors: Regret 2
  • Next Action: Proportional \[ \begin{pmatrix} \text{Rock} & 0/3 = 0 \\ \text{Paper} & 1/3 = 0.333 \\ \text{Scissors} & 2/3 = 0.667 \end{pmatrix} \]

  • Game 2: Choose Scissors (With probability \(2/3\)) and opponent chooses Rock

    • Lose 1
    • Rock: Regret 1
    • Paper: Regret 2
    • Scissors: Regret 0
  • Cumulative regrets:

    • Rock: 1
    • Paper: 3
    • Scissors: 2
  • Next Action: Proportional \[ \begin{pmatrix} \text{Rock} & 1/6 = 0167 \\ \text{Paper} & 3/6 = 0.500 \\ \text{Scissors} & 2/6 = 0.333 \end{pmatrix} \]

Regret matching definitions:

  • \(a\) is actions
  • \(\sigma\) is strategy
  • \(t\) is time
  • \(i\) is player
  • \(R\) is cumulative regret

\[ \sigma_i^t(a) = \begin{cases} \frac{\max(R_i^t(a), 0)}{\sum_{a' \in A} \max(R_i^t(a'), 0)} & \text{if } \sum_{a' \in A} \max(R_i^t(a'), 0) > 0 \\ \frac{1}{|A|} & \text{otherwise} \end{cases} \]

This is showing that we take the cumulative regret for an action divided by the cumulative regrets for all actions (normalizing) and then play that strategy for this action on the next iteration.

If all cumulative regrets are \(\leq 0\) then we use the uniform distribution.

If cumulative regrets are positive, but are are \(<0\) for a specific action, then we use \(0\) for that action.

In code:

    def get_strategy(self):
  #First find the normalizing sum
        normalizing_sum = 0
        for a in range(NUM_ACTIONS):
            if self.regret_sum[a] > 0:
                self.strategy[a] = self.regret_sum[a]
            else:
                self.strategy[a] = 0
            normalizing_sum += self.strategy[a]

    #Then normalize each action
        for a in range(NUM_ACTIONS):
            if normalizing_sum > 0:
                self.strategy[a] /= normalizing_sum
            else:
                self.strategy[a] = 1.0/NUM_ACTIONS
            self.strategy_sum[a] += self.strategy[a]

        return self.strategy

After using regret matching and after many iterations, we can minimize expected regret by using the average strategy at the end, which is the strategy that converges to equilibrium.

If two players were training against each other using regret matching, they would converge to the Nash Equilibrium of \(1/3\) for each action using the average strategy in Rock Paper Scissors.

RPS Regret Matching Experiment

Here we show that regret matching converges only using the average strategy over 10,000 iterations:

The bottom shows both players converging to \(1/3\), while the top shows Player 1’s volatile current strategies that are cycling around.

Suppose that your opponent Player 2 is playing 40% Rock, 30% Paper, and 30% Scissors. Here is a regret matching 10,000 game experiment. It shows that it takes around 1,600 games before Player 1 plays only Paper (this will vary).

We see that if there is a fixed player, regret matching converges to the best strategy.

But what if your opponent is not using a fixed strategy? We’ll talk about that soon.

Iterating through the Tree

The core feature of the iterative algorithms is self-play by traversing the game tree over all infosets and tracking the strategies and regrets at each.

From above, we know how to find the strategy and regret in the simple Rock Paper Scissors environment.

In poker:

  • Strategies are determined the same as above, through regret matching from the previous regret values at the specific information set for each action

  • CFR definitions:

    • \(a\) is actions
    • \(I\) is infoset
    • \(\sigma\) is strategy
    • \(t\) is time
    • \(i\) is player
    • \(R\) is cumulative regret
    • \(z\) is a terminal node
    • \(u\) is utility (payoffs)
    • \(p\) is the current player who plays at this node
    • \(-p\) is the the opponent player and chance
    • \(v\) is counterfactual value
  • Counterfactual values are effectively the value of an information set. They are weighted by the probability of opponent and chance playing to this node (in other words, the probability of playing to this node if this player tried to do so).

    • Counterfactual value: \(v^\sigma (I) = \sum_{z\in Z_I} \pi^{\sigma}_{-p}(z[I])\pi^{\sigma}(z[I] \rightarrow z)u_p(z)\)

    • \(\sum_{z\in Z_I}\) is summing over all terminal histories reachable from this node

    • \(\pi^{\sigma}_{-p}(z[I])\) is the probability of opponents and chance reaching this node

    • \(\pi^{\sigma}(z[I] \rightarrow z)\) is the probability of playing from this node to terminal history \(z\), i.e. the weight component of the expected value

    • \(u_p(z)\) is the utility at terminal history \(z\), i.e. the value component of the expected value

  • Instantaneous regrets are based on action values compared to infoset EV. Each action EV then adds to its regret counter:

    • \(r^t(I,a) = v^{\sigma^t}(I,a) - v^{\sigma^t}(I)\)
  • Cumulative (counterfactual) regrets are the sum of the individual regrets:

    • \(R^T(I,a) = \sum_{t=1}^T r^t(I,a)\)

Alternatives/Updates to Original Algorithm

  • CFR+ variation such that regrets can’t be \(\leq 0\)

  • Linear CFR such that regrets are weighted by their recency

  • Sampling

    • External: Sample chance and opponent nodes
    • Chance: Sample chance only
    • Outcome: Sample outcomes

Data: Talk Paper Scissors

eieio games made a Rock Paper Scissors over voice game in which players call a phone number and get matched up with another player for a 3 game RPS match.

They published their 40,000 round data on X:

Overall: R 37.2%, P 35.4%, S 27.4%

Round 1: R 39.7%, P 37.6%, S 22.7%

Round 2: R 34.0%, 33.4%, 32.6%

Round 3: R 37.2%, 34.7%, 28.1%

Expected Value Against TPS Player

What is the best strategy per round against the average TPS player? What is your expected value per round and overall?

The best strategy is to always play Paper.

\(\mathbb{E}(\text{Round 1}) = 0.397*1 + 0.376*0 + 0.227*-1 = 0.17\)

\(\mathbb{E}(\text{Round 2}) = 0.34*1 + 0.334*0 + 0.326*-1 = 0.014\)

\(\mathbb{E}(\text{Round 3}) = 0.372*1 + 0.347*0 + 0.281*-1 = 0.091\)

\(\mathbb{E}(\text{Overall}) = 0.17 + 0.014 + 0.091 = 0.275\)

More Exercises

Maximize Against non-Nash Fixed Opponent

How would you maximize in RPS knowing the opponent plays a fixed non-Nash strategy that you don’t know?

One option is to play the equilibrium strategy until you get a significant sample on your opponent and then to exploit their strategy going forward.

Strategy Against No-Rock Opponent

What is the optimal play if your opponent can’t play Rock?

Player 1/2 Paper Scissors
Rock (-1, 1) (1, -1)
Paper (0, 0) (-1, 1)
Scissors (1, -1) (0, 0)

We can see that Player 1 playing Paper is dominated by Scissors, so Player 1 should never play Paper.

Player 1/2 Paper Scissors
Rock (-1, 1) (1, -1)
Scissors (1, -1) (0, 0)

In the reduced game, we see that if Player 2 plays Paper with probability \(p\) and Scissors with probability \(s\), then:

\(\mathbb{E}(\text{P1 R}) = -1*p + 1*s = -p + s\) \(\mathbb{E}(\text{P1 S}) = 1*p + 0*s = p\)

Setting these equal, \(-p + s = p \Rightarrow s = 2p\).

We also know that \(s + p = 1\).

Therefore \(s = 1 - p\) and \(1 - p = 2p \Rightarrow 1 = 3p \Rightarrow p = 1/3\).

Therefore, \(s = 1 - 1/3 = 2/3\).

For Player 2, we have \(s = 2/3\) and \(p = 1/3\).

For Player 1, we can solve similarly:

\(\mathbb{E}(\text{P2 P}) = 1*r - 1*s = r - s\) \(\mathbb{E}(\text{P2 S}) = -1*r + 0*s = -r\)

\(r - s = -r \Rightarrow 2r = s\)

We also know that \(r + s = 1\).

Therefore \(s = 1 - r\) and \(1 - r = 2r \Rightarrow 1 = 3r \Rightarrow r = 1/3\).

Therefore, \(s = 1 - 1/3 = 2/3\).

For Player 2, we have \(s = 2/3\) and \(p = 1/3\).

Inserting these probabilities, we have:

Player 1/2 Paper (1/3) Scissors (2/3)
Rock (1/3) (-1, 1) (1/9) (1, -1) (2/9)
Scissors (2/3) (1, -1) (2/9) (0, 0) (4/9)

Therefore Player 1 has payoffs of: \(1/9 * -1 + 2/9 * 1 + 2/9 * 1 + 4/9 * 0 = 3/9 = 1/3\). Therefore the player that can still play Rock has an advantage of \(1/3\) at equilibrium.

Maximize Against Adapting Rock Opponent
  1. Suppose that your opponent is forced to play Rock exactly

Suppose that your opponent gets a card with probability \(X\) such that \(X\%\) of the time they are forced to play Rock What if the opponent is adapting to you, but 10% of the time they are forced to play Rock?

Soon

Skewed Rock Payoff

What is the equilibrium strategy if the payoff for Rock over Scissors is 2 (others stay the same)?

Soon