Leduc Holdem

Posted on by admin
Leduc Holdem Average ratng: 3,6/5 4975 reviews

Blackjack¶

Similarly, an information state of Leduc Hold’em can be encoded as a vector of length 30, as it contains 6 cards with 3 duplicates, 2 rounds, 0 to 2 raises per round and 3 actions. For learning in Leduc Hold’em, we manually calibrated nfsp for a fully connected neural network with 1 hidden layer of 64 neurons and rectified linear activations. Specifically, the rewards of betting games (Leduc Hold’em, Limit Texas Hold’em, No-limit Texas Holdem) are defined as the average winning big blinds per hand. The rewards of the other games are obtained directly from the winning rates. In Dou Dizhu, players are in different roles (landlord and peasants). We fix the role of the agent as the.

Blackjack is a globally popular banking game known as Twenty-One. Theobjective is to beat the dealer by reaching a higher score than thedealer without exceeding 21. In the toolkit, we implement a simpleversion of Blackjack. In each round, the player only has two options:“hit” which will take a card, and ‘stand’ which end the turn. The playerwill “bust” if his hands exceed 21 points. After the player completeshis hands (chooses “stand” and has not busted), the dealer then realshis hidden card and “hit” until obtaining at least 17 points.

State Representation of Blackjack¶

In this toy environment, we encode the stateas an array [player_score,dealer_score] where player_score isthe score currently obtained by the player, and the dealer_score isderived from the card that faces up from the dealer.

Action Encoding of Blackjack¶

There are two actions in the simple Blackjack. They areencoded as follows:

Action ID

Action

0

hit

1

stand

Payoff of Blackjack¶

The player may receive a reward -1 (lose), 0 (tie), or 1 (win) in theend of the game.

Leduc Hold’em¶

Leduc hold

Leduc Hold’em is a smaller version of Limit Texas Hold’em (firstintroduced in Bayes’ Bluff: Opponent Modeling inPoker). The deckconsists only two pairs of King, Queen and Jack, six cards in total.Each game is fixed with two players, two rounds, two-bet maximum andraise amounts of 2 and 4 in the first and second round. In the firstround, each player puts 1 unit in the pot and is dealt one card, thenstarts betting. In the second round, one public card is revealed first,then the players bet again. Finally, the player whose hand has the samerank as the public card is the winner. If neither, then the one withhigher rank wins. Other rules such as ‘fold’ can refer to Limit Texashold’em.

State Representation of Leduc Hold’em¶

The state is encoded as a vector of length 36. The first 3 elementscorrespond to hand card. The next 3 elements correspond to public card.The last 30 elements correspond the chips of the current player and theopponent (the chips could be in range 0~14) The correspondence betweenthe index and the card is as below.

Index

Meaning

0~2

J ~ K in hand

3~5

J ~ K as public card

6~20

0 ~ 14 chips for the current player

21~35

0 ~ 14 chips for the opponent

Action Encoding of Leduc Hold’em¶

The action encoding is the same as Limit Hold’em game.

Payoff of Leduc Hold’em¶

The payoff is calculated similarly with Limit Hold’em game.

Limit Texas Hold’em¶

Texas Hold’em is a popular betting game. Each player is dealt twoface-down cards, called hole cards. Then 5 community cards are dealt inthree stages (the flop, the turn and the river). Each player seeks thefive best cards among the hole cards and community cards. There are 4betting rounds. During each round each player can choose “call”,“check”, “raise”, or “fold”.

In fixed limit Texas Hold’em. Each player can only choose a fixed amountof raise. And in each round the number of raises is limited to 4.

State Representation of Limit Texas Hold’em¶

The state is encoded as a vector of length 72. The first 52 elementsrepresent cards, where each element corresponds to one card. The hand isrepresented as the two hole cards plus the observed community cards sofar. The last 20 elements are the betting history. The correspondencebetween the index and the card is as below.

Index

Meaning

0~12

Spade A ~ Spade K

13~25

Heart A ~ Heart K

26~38

Diamond A ~ Diamond K

39~51

Club A ~ Club K

52~56

Raise number in round 1

37~61

Raise number in round 2

62~66

Raise number in round 3

67~71

Raise number in round 4

Action Encoding of Limit Texas Hold’em¶

There 4 actions in Limit Texas Hold’em. They are encoded as below.

Action ID

Action

0

Call

1

Raise

2

Fold

3

Check

Payoff of Limit Texas Hold’em¶

The stardard unit used in the leterature is milli big blinds per hand(mbb/h). In the toolkit, the reward is calculated based on big blindsper hand. For example, a reward of 0.5 (-0.5) means that the player wins(loses) 0.5 times of the amount of big blind.

Dou Dizhu¶

Doudizhu is one of the most popular Chinese card games with hundreds ofmillions of players. It is played by three people with one pack of 54cards including a red joker and a black joker. After bidding, one playerwould be the “landlord” who can get an extra three cards, and the othertwo would be “peasants” who work together to fight against the landlord.In each round of the game, the starting player must play a card or acombination, and the other two players can decide whether to follow or“pass.” A round is finished if two consecutive players choose “pass.”The player who played the cards with the highest rank will be the firstto play in the next round. The objective of the game is to be the firstplayer to get rid of all the cards in hand. For detailed rules, pleaserefer to Wikipedia orBaike.

In the toolkit, we implement a standard version of Doudizhu. In thebidding phase, we heuristically designate one of the players as the“landlord.” Specifically, we count the number of key cards orcombinations (high-rank cards and bombs), and the player with the mostpowerful hand is chosen as “lanlord.”

State Representation of Dou Dizhu¶

At each decision point of the game, the corresponding player will beable to observe the current state (or information set in imperfectinformation game). The state consists of all the information that theplayer can observe from his view. We encode the information into areadable Python dictionary. The following table shows the structure ofthe state:

Key

Description

Example value

deck

A string of one pack of 54 cards with Black Joker and Red Joker. Each character means a card. For conciseness, we use ‘T’ for ‘10’.

3333444455556666777788889999TTTTJJJJQQQQKKKKAAAA2222BR

seen_cards

Three face-down cards distributed to the landlord after bidding. Then these cards will be made public to all players.

TQA

landlord

An integer of landlord’s id

0

self

An integer of current player’s id

2

initial_hand

All cards current player initially owned when a game starts. It will not change with playing cards.

3456677799TJQKAAB

trace

A list of tuples which records every actions in one game. The first entry of the tuple is player’s id, the second is corresponding player’s action.

[(0, ‘8222’), (1, ‘pass’), (2, ‘pass’), (0 ‘6KKK’), (1, ‘pass’), (2, ‘pass’), (0, ‘8’), (1, ‘Q’)]

played_cards

As the game progresses, the cards which have been played by the three players and sorted from low to high.

[‘6’, ‘8’, ‘8’, ‘Q’, ‘K’, ‘K’, ‘K’, ‘2’, ‘2’, ‘2’]

others_hand

The union of the other two player’s current hand

333444555678899TTTJJJQQAA2R

current_hand

The current hand of current player

3456677799TJQKAAB

actions

The legal actions the current player could do

[‘pass’, ‘K’, ‘A’, ‘B’]

State Encoding of Dou Dizhu¶

In Dou Dizhu environment, we encode the state into 6 feature planes. Thesize of each plane is 5*15. Each entry of a plane can be either 1 or 0.The 5 rows represent 0, 1, 2, 3, 4 corresonding cards, respectively. The15 columns start from “3” to “RJ” (Black Jack). For example, if we havea “3”, then the entry (1, 0) would be 1, and the rest of column 0 wouldbe 0. If we have a pair of “4”, then the entry (2, 1) would be 1, andthe rest of column 1 would be 0. Note that the current encoding methodis just an example to show how the feature can be encoded. Users areencouraged to encode the state for their own purposes by modifyingextract_state function in`rlcard/envs/doudizhu.py`__. The exampleencoded planes are as below:

Plane

Feature

0

the current hand

1

the union of the other two players’ hand

2-4

the recent three actions

5

the union of all played cards

Action Abstraction of Dou Dizhu¶

The size of the action space of Dou Dizhu is 27472. This number is toolarge for learning algorithms. Thus, we make abstractions to theoriginal action space and obtain 309 actions. We note that some recentstudies also use similar abstraction techniques. The main idea of theabstraction is to make the kicker fuzzy and only focus on the major partof the combination. For example, “33344” is abstracted as “333 **”.When the predicted action of the agent is not legal, the agent willchoose “pass.”. Thus, the current environment is simple, since oncethe agent learns how to play legal actions, it can beat random agents.Users can also encode the actions for their own purposes (such asincreasing the difficulty of the environment) by modifyingdecode_action function inrlcard/envs/doudizhu.py. Users are alsoencouraged to include rule-based agents as opponents. The abstractionsin the environment are as below. The detailed mapping of action and itsID is inrlcard/games/doudizhu/jsondata/action_space.json:

Type

Number ofActions

Number ofActions afterAbstraction

Action ID

Solo

15

15

0-14

pair

13

13

15-27

Trio

13

13

28-40

Trio withsingle

182

13

41-53

Trio with pair

156

13

54-66

Chain of solo

36

36

67-102

Chain of pair

52

52

103-154

Chain of trio

45

45

155-199

Plane with solo

21822

38

200-237

Plane with pair

2939

30

238-267

Quad with solo

1326

13

268-280

Quad with pair

858

13

281-293

Bomb

13

13

294-306

Rocket

1

1

307

Pass

1

1

308

Total

27472

309

Payoff¶

If the landlord first get rid of all the cards in his hand, he will winand receive a reward 1. The two peasants will lose and receive a reward0. Similarly, if one of the peasant first get rid of all the cards inhand, both peasants will win and receive a reward 1. The landlord willlose and receive a reward 0.

Simple Dou Dizhu¶

Simple Dou Dizhu is a smaller version of Dou Dizhu. The deck onlyconsists of 6 ranks from ‘8’ to ‘A’ (8, 9, T, J, Q, K, A), there arefour cards with different suits in each rank. What’s more, unlikelandlord in Dou Dizhu, the landlord in Simple Dou Dizhu only has onemore card than the peasants. The rules of this game is the same as therules of Dou Dizhu. Just because each player gets fewer cards, they endthe game faster.

State Representation of Simple Dou Dizhu¶

This is almost the smae as the state representation of Dou Dizhu, butthe number of the ‘deck’ has reduced from 54 to 28, and the number ofthe ‘seen cards’ reduced from 3 to 1. The following table shows thestructure of the state:

Key

Description

Example value

deck

A string of one pack of 28 cards without Black Joker and Red Joker. Each character means a card. For conciseness, we use ‘T’ for ‘10’.

88889999TTTTJJJJQQQQKKKKAAAA

seen_cards

One face-down card distributed to the landlord after bidding. Then the card will be made public to all players.

K

landlord

An integer of landlord’s id

0

self

An integer of current player’s id

1

initial_hand

All cards current player initially owned when a game starts. It will not change with playing cards.

8TTJJQQKA

trace

A list of tuples which records every actions in one game. The first entry of the tuple is player’s id, the second is corresponding player’s action.

[(0, ‘8’), (1, ‘A’), (2, ‘pass’), (0, ‘pass’)]

played_cards

As the game progresses, the cards which have been played by the three players and sorted from low to high.

[‘8’, ‘A’]

others_hand

The union of the other two player’s current hand

889999TTJJQQKKKAAA

current_hand

The current hand of current player

8TTJJQQK

actions

The legal actions the current player could do

[‘J’, ‘TTJJQQ’, ‘TT’, ‘Q’, ‘T’, ‘K’, ‘QQ’, ‘8’, ‘JJ’]

State Encoding of Simple Dou Dizhu¶

The state encoding is the same as Dou Dizhu game.

Action Encoding of Simple Dou Dizhu¶

The action encoding is the same as Dou Dizhu game. Because of thereduction of deck, the actions encoded have also reduced from 309 to131.

Payoff of Simple Dou Dizhu¶

The payoff is the same as Dou Dizhu game.

Mahjong¶

Mahjong is a tile-based game developed in China, and has spread

throughout the world since 20th century. It is commonly played by 4players. The game is played with a set of 136 tiles. In turn playersdraw and discard tiles until The goal of the game is to complete the leagal hand using the 14thdrawn tile to form 4 sets and a pair. We revised the game into a simpleversion that all of the winning set are equal, and player will win aslong as she complete forming 4 sets and a pair. Please refer the detailon Wikipedia orBaike.

State Representation of Mahjong¶

The state representation of Mahjong is encoded as 6 feature planes,where each plane has 34 X 4 dimensions. For each plane, the column ofthe plane indicates the number of the cards in the given cards set, andthe row of the plane represents each kind of cards (Please refer to theaction space table). The information that has been encoded can berefered as follows:

Plane

Feature

0

the cards in current player’s hand

1

the played cards on the table

2-5

the public piles of each players

Action Space of Mahjong¶

There are 38 actions in Mahjong.

Action ID

Action

0 ~ 8

Bamboo-1 ~ Bamboo-9

9 ~ 17

Characters-1 ~ Character-9

18 ~ 26

Dots-1 ~ Dots-9

27

Dragons-green

28

Dragons-red

29

Dragons-white

30

Winds-east

31

Winds-west

32

Winds-north

33

Winds-south

34

Pong

35

Chow

36

Gong

37

Stand

Payoff of Mahjong¶

The reward is calculated by the terminal state of the game, wherewinning player is awarded as 1, losing players are punished as -1. Andif no one win the game, then all players’ reward will be 0.

No-limit Texas Hold’em¶

No-limit Texas Hold’em has similar rule with Limit Texas Hold’em. Butunlike in Limit Texas Hold’em game in which each player can only choosea fixed amount of raise and the number of raises is limited. In No-limitTexas Hold’em, The player may raise with at least the same amount asprevious raised amount in the same round (or the minimum raise amountset before the game if none has raised), and up to the player’sremaining stack. The number of raises is also unlimited.

State Representation of No-Limit Texas Hold’em¶

The state representation is similar to Limit Hold’em game. The state isrepresented as 52 cards and 2 elements of the chips of the players asbelow:

Index

Meaning

0~12

Spade A ~ Spade K

13~25

Heart A ~ Heart K

26~38

Diamond A ~ Diamond K

39~51

Club A ~ Club K

52

Chips of player 1

53

Chips of player 2

Action Encoding of No-Limit Texas Hold’em¶

There are 6 actions in No-limit Texas Hold’em. They are encoded asbelow.

*Note: Starting from Action ID 3, the action means the amount playershould put in the pot when chooses ‘Raise’. The action ID from 3 to 5corresponds to the bet amount from half amount of the pot, full amountof the pot to all in.

Action ID

Action

0

Fold

1

Check

2

Call

3

Raise Half Pot

4

Raise Full Pot

5

All In

Payoff of No-Limit Texas Hold’em¶

The reward is calculated based on big blinds per hand. For example, areward of 0.5 (-0.5) means that the player wins (loses) 0.5 times of theamount of big blind.

UNO¶

Uno is an American shedding-type card game that is played with aspecially deck.The game is for 2-10 players. Every player starts withseven cards, and they are dealt face down. The rest of the cards areplaced in a Draw Pile face down. Next to the pile a space should bedesignated for a Discard Pile. The top card should be placed in theDiscard Pile, and the game begins. The first player is normally theplayer to the left of the dealer and gameplay usually follows aclockwise direction. Every player views his/her cards and tries to matchthe card in the Discard pile. Players have to match either by thenumber, color, or the symbol/action. If the player has no matches, theymust draw a card. If that card can be played, play it. Otherwise, keepthe card. The objective of the game is to be the first player to get ridof all the cards in hand. For detailed rules, please refer toWikipedia or UnoRules. And in our toolkit, the number ofplayers is 2.

State Representation of Uno¶

In state representation, each card is represented as a string of colorand trait(number, symbol/action). ‘r’, ‘b’, ‘y’, ‘g’ represent red,blue, yellow and green respectively. And at each decision point of thegame, the corresponding player will be able to observe the current state(or information set in imperfect information game). The state consistsof all the information that the player can observe from his view. Weencode the information into a readable Python dictionary. The followingtable shows the structure of the state:

Key

Description

Example value

hand

A list of the player’s currenthand.

[‘g-wild’, ‘b-0’, ‘g-draw_2’,‘y-skip’, ‘r-draw_2’, ‘y-3’,‘y-wild’]

target

The top card in the Discardpile

‘g-wild’

played_cards

As the game progresses, thecards which have been playedby the players

[‘g-3’, ‘g-wild’]

others_hand

The union of the otherplayer’s current hand

[‘b-0’, ‘g-draw_2’, ‘y-skip’,‘r-draw_2’, ‘y-3’, ‘r-wild’]

State Encoding of Uno¶

In Uno environment, we encode the state into 7 feature planes. The sizeof each plane is 4*15. Row number 4 means four colors. Column number 15means 10 number cards from 0 to 9 and 5 special cards—“Wild”, “Wild DrawFour”, “Skip”, “Draw Two”, and “Reverse”. Each entry of a plane can beeither 1 or 0. Note that the current encoding method is just an exampleto show how the feature can be encoded. Users are encouraged to encodethe state for their own purposes by modifying extract_state functionin `rlcard/envs/uno.py`__. The example encodedplanes are as below:

Plane

Feature

0-2

hand

3

target

4-6

others’ hand

We use 3 planes to represnt players’ hand. Specifically, planes 0-2represent 0 card, 1 card, 2 cards, respectively. Planes 4-6 are thesame.

Action Encoding of Uno¶

There are 61 actions in Uno. They are encoded as below. The detailedmapping of action and its ID is inrlcard/games/uno/jsondata/action_space.json:

Action ID

Action

0~9

Red number cards from 0 to 9

10~12

Red action cards: skip, reverse, draw 2

13

Red wild card

14

Red wild and draw 4 card

15~24

green number cards from 0 to 9

25~27

green action cards: skip, reverse, draw 2

28

green wild card

29

green wild and draw 4 card

30~39

blue number cards from 0 to 9

40~42

blue action cards: skip, reverse, draw 2

43

blue wild card

44

blue wild and draw 4 card

45~54

yellow number cards from 0 to 9

55~57

yellow action cards: skip, reverse, draw 2

58

yellow wild card

59

yellow wild and draw 4 card

60

draw

Payoff of Uno¶

Each player will receive a reward -1 (lose) or 1 (win) in the end of thegame.

Gin Rummy¶

Gin Rummy is a popular two person card game using a regular 52 card deck(ace being low). The dealer deals 11 cards to his opponent and 10 cardsto himself. Each player tries to form melds of 3+ cards of the same rankor 3+ cards of the same suit in sequence. If the deadwood count of thenon-melded cards is 10 or less, the player can knock. If all cards canbe melded, the player can gin. Please refer the detail onWikipedia.

If a player knocks or gins, the hand ends, each player put down theirmelds, and their scores are determined. If a player knocks, the opponentcan layoff some of his deadwood cards if they extend melds of theknocker. The score is the difference between the two deadwood counts ofthe players. If the score is positive, the player going out receives it.Otherwise, if the score is zero or negative, the opponent has undercutthe player going out and receives the value of the score plus a 25 pointundercut bonus.

The non-dealer discards first (or knocks or gins if he can). If theplayer has not knocked or ginned, the next player can pick up thediscard or draw a card from the face down stockpile. He can knock or ginand the hand ends. Otherwise, he must discard and the next playercontinues in the same fashion. If the stockpile is reduced to two cardsonly, then the hand is declared dead and no points are scored.

State Representation of Gin Rummy¶

The state representation of Gin Rummy is encoded as 5 feature planes,where each plane is of dimension 52. For each plane, the column of theplane indicates the presence of the card (ordered from AS to KC). Theinformation that has been encoded can be referred as follows:

Plane

Feature

0

the cards in current player’s hand

1

the top card of the discard pile

2

the dead cards: cards in discard pile (excluding the top card)

3

opponent known cards: cards picked up from discard pile, but not discarded

4

the unknown cards: cards in stockpile or in opponent hand (but not known)

Action Space of Gin Rummy¶

There are 110 actions in Gin Rummy.

Action ID

Action

0

score_player_0_action

1

score_player_1_action

2

draw_card_action

3

pick_up_discard_action

4

declare_dead_hand_action

5

gin_action

6 - 57

discard_action

58 - 109

knock_action

Payoff of Gin Rummy¶

The reward is calculated by the terminal state of the game. Note thatthe reward is different from that of the standard game. A player whogins is awarded 1 point. A player who knocks is awarded 0.2 points. Thelosing player is punished by the negative of their deadwood countdivided by 100.

If the hand is declared dead, both players are punished by the negativeof their deadwood count divided by 100.

Settings¶

The following options can be set.

Option

Default value

dealer_for_round

DealerForRound.Random

stockpile_dead_card_count

2

going_out_deadwood_count

10

max_drawn_card_count

52

is_allowed_knock

True

is_allowed_gin

True

is_allowed_pick_up_discard

True

is_allowed_to_discard_picked_up_card

False

is_always_knock

False

is_south_never_knocks

False

Variations¶

One can create variations that are easier to train by changing theoptions and specifying different scoring methods.

by Nick Christenson

This article originally appeared in the February 2010 issue of TwoPlusTwo Magazine.

Preliminaries

Last time around I discussed a paper on optimal strategies for poker from the University of Alberta's Computer Poker Research Group (CPRG). This month I examine a paperfrom the same group on the issue of opponent modeling. It is titled,'Bayes'Bluff: Opponent Modeling in Poker'. It was presented at the 21st Conference on Uncertainty in Artificial Intelligence (UAI)in 2005.

There are two ways to approach creating a winning poker algorithm,and both are important in understanding the game. The first is tryingto play the game optimally, that is, trying to 'solve' the game. However,as was discussed in the last article in this series, by no means doesplaying optimally guarantee that one will win the most money. It justguarantees you won't lose (or, in the case of negative-sum games, youwill do no worse than lose your share of the rake.) In order to maximizeprofits against opponents that don't play optimally, one must understandtheir strategy and then compute a response against it that maximizesprofit. The first step in doing this, understanding opponents' strategy, is called 'opponent modeling'.

Bayesian Poker

One of the ways to model an opponents strategy is using a mathematicaltechnique called Bayesian statistics. This is based around Bayes theorem.It's easy to misunderstand what Bayes theorem actually says, so I'll tryto explain it as simply as I can.

Suppose there are two events, A and B, and they influence each other.If we find out that B has occurred, then the probability that A willalso occur depends not only on the probabilistic relationship betweenA and B, but also on the probability that A will occur independentlyof the information that B has occurred.

Here's an example. Suppose we know that sometimes after it rains, italso snows. Suppose we go outside and see that it's snowing? Whatdoes that tell us about whether it was raining earlier? If we knowthe odds of it snowing after it rains, that's only part of the picture.We also need to know the odds that it could just snow without raining.If we have both of these pieces of information, then using Bayesianstatistics we can make a prediction about the likelihood that it wasraining earlier.

In the abstract case, if we don't know anything about whether or notB occurred, then in a vacuum we may still make a prediction about theprobability that A will occur. In the example this corresponds to the probability that it might have just rained today. This probability distribution is called a 'prior', as it's what we think the likelihood of some event is 'prior' to having any additional information that would influence our opinion.

If we know B, or whether or not it's snowing in the example, that can change our estimate of whether or not it might have been raining earlier. Thisnew probability distribution is called the 'posterior' (meaning 'after').Yes, statistics students make jokes about this. On rare occasion they'reeven funny. I bring this whole thing up because in the 'Opponent Modeling' paper I'm discussing, if you don't understand what 'prior' and 'posterior'mean you'll be completely lost.

Sometimes Bayes' theorem sounds like simple if/then conditional probabilities. It's more than that. This can be confusing. In any case, this technique can be used to model poker opponents. If we know what players show down and what percentage of the time they adopt variousbetting strategies, we can make some estimates about how they wouldplay various hands. From that make predictions about what hand ranges they might hold.

Introduction

The paper begins with the authors setting the stage for whatthey're going to demonstrate. Assuming our opponent plays sub-optimally, to maximize profit we must take two steps:

  1. Figure out our opponent's strategies.
  2. Determine a best response to their plays, including their mistakes.
Neither step is trivial, and both are discussed in this paper.

Several approaches to this problem are possible, and good solutions aredifficult. This difficulty has several causes. First, we won't everhave complete knowledge of our opponents' strategy. This is due both tothe fact that we'll never have enough information to fill in all the gaps, and it's unlikely that our opponent will continue to play the way they always have.

Leduc hold

Poker

The second section of the paper is more routine for poker players thanit is for mathematicians or artificial intelligence researchers. Theauthors first define the characteristics of a generic hold 'em game: Each player is dealt one or more hole cards, there will be one or more board cards, there will be two or more rounds of betting, and between rounds of betting one or more additional board cards will be revealed. The authors are examining two-player limit Texas hold 'em, as has traditionallybeen the focus of the CPRG folks.

In addition to looking at Texas hold 'em, they also examining anabbreviated game, which they call 'Leduc hold 'em'. Leduc is a smalltown near Edmonton, near the home of the University of Alberta and the CPRG. This game is similar in purpose to Rhode Island hold 'em, which was explained in an earlier article. The naming is essentially the same joke. I guess the difference is that Canada doesn't have any truly small provinces.For research purposes, the nice thing about Leduc (or Rhode Island) hold 'em is that it has many of the same characteristics of Texas hold 'em, but is small enough that optimal strategies can be determined explicitly.

The authors then go on to discuss some of the difficulties in creatinggood poker playing algorithms, but I won't discuss those here becausewhat they have to say is quite accessible in the paper.

Modeling the Opponent

When modeling the opponent, the authors assume their opponent's strategy is 'stationary', that is, it's not changing overtime. Of course, this is a bad assumption to make for poker players in general. It precludes dealing with players who learn over time,and it precludes adjusting effectively to players who change gears,but it's a good place to start. As an extension of this, they also assume hands are 'i.i.d.', which stands for 'independent and identically-distributed'. This means that in addition to nothaving the effects of one hand influence subsequent hands, the game is fair.

Before we can build a strategy for our opponent, we must come up witha representation for the data we'll store about each hand in somedatabase. The authors explain that for each sample hand they're usingto derive an opponent strategy they're storing hole card informationfor both players (when available), board cards, and the bet sequencefor each betting round.

The idea is that when we see a given hand being played, we can compareit to previous hands in our database, and come up with a Bayesianprobability distribution for the types of hands our opponent might have. This distribution is what is called the 'posterior'.

Responding to the Opponent

Having a distribution of what we think our opponent's possible handsmight be is only half the battle. Now we must compute a response tothis hand range. The authors consider several options:

Bayesian Best Response (BBR) - This is a direct computation of thehighest EV response given the expected distribution of all possible hole cards. The authors point out that solving this is equivalent to solving something called theExpectimax algorithm.The Expectimax algorithm is the solution to a game tree where we are pickingeach node on the tree such that we expect to maximize our expectation forthe game. It's sort of a multi-round generalization of the Minimaxalgorithm of elementary game theory. The problem with BBR is that it'svery difficult to compute, even if you have all the information you need.So, for a 'real' poker game, this isn't practical. The best we can dois find approximations for BBR.

Max a posteriori response (MAP response) - 'a posteriori' is Latin for'from the latter', meaning 'from effects to causes'. In this method what we do is determine the best possible result for our opponent given the strategy they're playing, and then just calculate what our best responseis to that strategy. For complex games this isn't trivial either, but it's much easier than computing a BBR solution.

Thompson's Response - Here we pick a strategy from our opponent's distribution of previous hands weighted by Bayes theorem and play a best response to that particular strategy. In some sense, with MAPwe find a counter-strategy to the worst-case scenario. With Thompson'swe find a counter-strategy to the most likely scenario.

Later in the paper the authors will compare each of these methods of determining a response to an opponent's play.

Priors

Leduc Hold'em Game

What we call the 'prior' is equivalent to what most poker players would call an opponent's hand database. Without a good prior we can't make goodguesses about how our opponent's actions correspond with previous play,and without this information we have no chance of calculating a profitable response.

A good prior should have several properties. It should capturethe strategy of our opponent. It should be set up to make it easy tocalculate our posterior. It's structure should make it easyto calculate good responses. Given all this, it should be as smallas possible. Obviously, some of these work in opposition to each other.Generating a good prior is not simple, and I'm not aware of a generalapproach to creating or improving a prior.

The authors explore different priors for the two different poker games. For Leduc hold 'em they use a Dirichlet distribution. This is a special kind of probability distribution that is appropriate for this situation. Unfortunately, I don't know how to explain the details of why they chose this particular distribution in simple terms, so you'll just have to trust the authors here. In any case, the game space is small enough so they can take a good sample of an opponent's play that spans many possible hand outcomes.

The second prior, which they use for the Texas hold 'em samples, is what they call an 'informed' prior. That is, a skilled player selectswhat he or she feels is a subset of hands that adequately represents someother player's strategy. They use these samples to create several parameters to define how an opponent plays. These are described in Table 1 of the paper. Many of these will seem quite familiar to anyone who has used opponent data gathering software designed for online poker. The parametersthe authors use include the following:Leduc

Results were averaged over 1000 trials for Leduc hold'em and 280trials for Texas hold'em. Each trial (what poker players wouldcall a session), consisted of 200 hands. For each trial1000 strategies were sampled from the prior and used foropponent modeling. Also in each trial, different Bayesian methods, BBR, MAP, and Thompson's, were used to come up with a responseto each strategy.

Leduc Holdem Tournaments

These algorithms also played against two otheropponents: 'Opti', which in Leduc hold 'em is the optimal strategy, anda 'bot called 'Frequentist', which is an adaptive opponent modeling program that corresponds to the 'Vexbot' opponent found in the commercially available Poki's Poker Academy software.

Results

Leduc hold 'em:

The results of the various algorithms playing against the player are shown in Figure 2. Since for Leduchold'em we can explicitly compute the best possible response againstany strategy, that result is shown in the top line. Needless to say, none of the Bayesian strategies achieve this win rate. Of note, though,is that all of the Bayesian methods converge to similar results quitequickly. Moreover, all of them perform better than the Frequentistnon-Bayesian opponent modeling 'bot.

Note that all of these opponent modeling strategies produce a betterreturn than the Optimum strategy 'bot. By definition none of them played 'better' than Opti, but all of them played more exploitively. While the results are impressive, it would be easy to overstate their significance. Remember that our Bayesian models are playing against opponents employing stationary strategies from whom the prior was drawn. Clearly, though, these methods have some merit.

I want to point out that in paragraph 4 of section 7.1 the authors explain why they believe Frequentist didn't perform well. There is some good information here that I won't repeat, because I can't improve on what the authors said in that paragraph.

When the Bayesian models play against Opti, they all lose, which weexpect (actually, in the long run this has to be the case). Theperformance of each of the Bayesian models is again comparable. Also, once again the Bayesian 'bots do better than Frequentist.

The Bayesian models also win against Frequentist, although in thiscase Opti wins more. We also see some small divergence in long-termperformance of each of the Bayesian models with MAP and Thompson'soutperforming BBR. This is unexpected, and the authors discuss whythis might be the case in the last two paragraphs of section 7.1. To be honest, though, I find their justification a little unsatisfying.

Texas hold 'em:

For Texas hold'em we can't generate a Dirichlet prior, nor can we solve BBR for a game this complex. I expect that we also can't solve MAP, although the authors don't say so explicitly. In any case, thislimits the number of competitors to play against the the opponentwho generated the prior. These are Frequentist, Bayesian using the Thompson's response strategy, and Opti.

I don't know what Opti represents in this case; the authors don't say.Obviously, it can't represent a true optimal strategy as it does inthe Leduc hold 'em case. I presume it's the pseudo-optimal Opti algorithm as discussed in their previous paper, 'Approximating Game-Theoretic Optimal Strategies for Full-Scale Poker', which I discussed in aDecember '09 article.

In any case, both the Frequentist and the Bayesian opponents performedcomparably, with the Bayesian algorithm gaining a slight lead over Frequentist late in the contest. Both beat the non-exploitive Opti algorithm. The authors speculate that a 200 hand prior may nothave been enough for the Bayesian model to be able to assert a biggeradvantage over Frequentist.

Conclusions

This work shows that Bayesian methods of opponent modeling havepromise. Moreover, just a few hundred hands of data can provide usefulexploitive strategies, even for real poker games such as Texas hold 'em.

Leduc Hold'em Poker

There are many areas of future work for this methodology that are likelyto be fruitful. One of these is research in coming up with better priors, both in terms of strategy and in terms of representing an opponent strategy in a minimal number of hands. Clearly, another avenue for exploration is to make these models more effective against opponentswho vary their strategies.