Leduc Holdem
Blackjack¶
Similarly, an information state of Leduc Hold’em can be encoded as a vector of length 30, as it contains 6 cards with 3 duplicates, 2 rounds, 0 to 2 raises per round and 3 actions. For learning in Leduc Hold’em, we manually calibrated nfsp for a fully connected neural network with 1 hidden layer of 64 neurons and rectified linear activations. Specifically, the rewards of betting games (Leduc Hold’em, Limit Texas Hold’em, No-limit Texas Holdem) are defined as the average winning big blinds per hand. The rewards of the other games are obtained directly from the winning rates. In Dou Dizhu, players are in different roles (landlord and peasants). We fix the role of the agent as the.
Blackjack is a globally popular banking game known as Twenty-One. Theobjective is to beat the dealer by reaching a higher score than thedealer without exceeding 21. In the toolkit, we implement a simpleversion of Blackjack. In each round, the player only has two options:“hit” which will take a card, and ‘stand’ which end the turn. The playerwill “bust” if his hands exceed 21 points. After the player completeshis hands (chooses “stand” and has not busted), the dealer then realshis hidden card and “hit” until obtaining at least 17 points.
State Representation of Blackjack¶
In this toy environment, we encode the stateas an array [player_score,dealer_score]
where player_score
isthe score currently obtained by the player, and the dealer_score
isderived from the card that faces up from the dealer.
Action Encoding of Blackjack¶
There are two actions in the simple Blackjack. They areencoded as follows:
Action ID | Action |
---|---|
0 | hit |
1 | stand |
Payoff of Blackjack¶
The player may receive a reward -1 (lose), 0 (tie), or 1 (win) in theend of the game.
Leduc Hold’em¶
Leduc Hold’em is a smaller version of Limit Texas Hold’em (firstintroduced in Bayes’ Bluff: Opponent Modeling inPoker). The deckconsists only two pairs of King, Queen and Jack, six cards in total.Each game is fixed with two players, two rounds, two-bet maximum andraise amounts of 2 and 4 in the first and second round. In the firstround, each player puts 1 unit in the pot and is dealt one card, thenstarts betting. In the second round, one public card is revealed first,then the players bet again. Finally, the player whose hand has the samerank as the public card is the winner. If neither, then the one withhigher rank wins. Other rules such as ‘fold’ can refer to Limit Texashold’em.
State Representation of Leduc Hold’em¶
The state is encoded as a vector of length 36. The first 3 elementscorrespond to hand card. The next 3 elements correspond to public card.The last 30 elements correspond the chips of the current player and theopponent (the chips could be in range 0~14) The correspondence betweenthe index and the card is as below.
Index | Meaning |
---|---|
0~2 | J ~ K in hand |
3~5 | J ~ K as public card |
6~20 | 0 ~ 14 chips for the current player |
21~35 | 0 ~ 14 chips for the opponent |
Action Encoding of Leduc Hold’em¶
The action encoding is the same as Limit Hold’em game.
Payoff of Leduc Hold’em¶
The payoff is calculated similarly with Limit Hold’em game.
Limit Texas Hold’em¶
Texas Hold’em is a popular betting game. Each player is dealt twoface-down cards, called hole cards. Then 5 community cards are dealt inthree stages (the flop, the turn and the river). Each player seeks thefive best cards among the hole cards and community cards. There are 4betting rounds. During each round each player can choose “call”,“check”, “raise”, or “fold”.
In fixed limit Texas Hold’em. Each player can only choose a fixed amountof raise. And in each round the number of raises is limited to 4.
State Representation of Limit Texas Hold’em¶
The state is encoded as a vector of length 72. The first 52 elementsrepresent cards, where each element corresponds to one card. The hand isrepresented as the two hole cards plus the observed community cards sofar. The last 20 elements are the betting history. The correspondencebetween the index and the card is as below.
Index | Meaning |
---|---|
0~12 | Spade A ~ Spade K |
13~25 | Heart A ~ Heart K |
26~38 | Diamond A ~ Diamond K |
39~51 | Club A ~ Club K |
52~56 | Raise number in round 1 |
37~61 | Raise number in round 2 |
62~66 | Raise number in round 3 |
67~71 | Raise number in round 4 |
Action Encoding of Limit Texas Hold’em¶
There 4 actions in Limit Texas Hold’em. They are encoded as below.
Action ID | Action |
---|---|
0 | Call |
1 | Raise |
2 | Fold |
3 | Check |
Payoff of Limit Texas Hold’em¶
The stardard unit used in the leterature is milli big blinds per hand(mbb/h). In the toolkit, the reward is calculated based on big blindsper hand. For example, a reward of 0.5 (-0.5) means that the player wins(loses) 0.5 times of the amount of big blind.
Dou Dizhu¶
Doudizhu is one of the most popular Chinese card games with hundreds ofmillions of players. It is played by three people with one pack of 54cards including a red joker and a black joker. After bidding, one playerwould be the “landlord” who can get an extra three cards, and the othertwo would be “peasants” who work together to fight against the landlord.In each round of the game, the starting player must play a card or acombination, and the other two players can decide whether to follow or“pass.” A round is finished if two consecutive players choose “pass.”The player who played the cards with the highest rank will be the firstto play in the next round. The objective of the game is to be the firstplayer to get rid of all the cards in hand. For detailed rules, pleaserefer to Wikipedia orBaike.
In the toolkit, we implement a standard version of Doudizhu. In thebidding phase, we heuristically designate one of the players as the“landlord.” Specifically, we count the number of key cards orcombinations (high-rank cards and bombs), and the player with the mostpowerful hand is chosen as “lanlord.”
State Representation of Dou Dizhu¶
At each decision point of the game, the corresponding player will beable to observe the current state (or information set in imperfectinformation game). The state consists of all the information that theplayer can observe from his view. We encode the information into areadable Python dictionary. The following table shows the structure ofthe state:
Key | Description | Example value |
---|---|---|
deck | A string of one pack of 54 cards with Black Joker and Red Joker. Each character means a card. For conciseness, we use ‘T’ for ‘10’. | 3333444455556666777788889999TTTTJJJJQQQQKKKKAAAA2222BR |
seen_cards | Three face-down cards distributed to the landlord after bidding. Then these cards will be made public to all players. | TQA |
landlord | An integer of landlord’s id | 0 |
self | An integer of current player’s id | 2 |
initial_hand | All cards current player initially owned when a game starts. It will not change with playing cards. | 3456677799TJQKAAB |
trace | A list of tuples which records every actions in one game. The first entry of the tuple is player’s id, the second is corresponding player’s action. | [(0, ‘8222’), (1, ‘pass’), (2, ‘pass’), (0 ‘6KKK’), (1, ‘pass’), (2, ‘pass’), (0, ‘8’), (1, ‘Q’)] |
played_cards | As the game progresses, the cards which have been played by the three players and sorted from low to high. | [‘6’, ‘8’, ‘8’, ‘Q’, ‘K’, ‘K’, ‘K’, ‘2’, ‘2’, ‘2’] |
others_hand | The union of the other two player’s current hand | 333444555678899TTTJJJQQAA2R |
current_hand | The current hand of current player | 3456677799TJQKAAB |
actions | The legal actions the current player could do | [‘pass’, ‘K’, ‘A’, ‘B’] |
State Encoding of Dou Dizhu¶
In Dou Dizhu environment, we encode the state into 6 feature planes. Thesize of each plane is 5*15. Each entry of a plane can be either 1 or 0.The 5 rows represent 0, 1, 2, 3, 4 corresonding cards, respectively. The15 columns start from “3” to “RJ” (Black Jack). For example, if we havea “3”, then the entry (1, 0) would be 1, and the rest of column 0 wouldbe 0. If we have a pair of “4”, then the entry (2, 1) would be 1, andthe rest of column 1 would be 0. Note that the current encoding methodis just an example to show how the feature can be encoded. Users areencouraged to encode the state for their own purposes by modifyingextract_state
function in`rlcard/envs/doudizhu.py`__. The exampleencoded planes are as below:
Plane | Feature |
---|---|
0 | the current hand |
1 | the union of the other two players’ hand |
2-4 | the recent three actions |
5 | the union of all played cards |
Action Abstraction of Dou Dizhu¶
The size of the action space of Dou Dizhu is 27472. This number is toolarge for learning algorithms. Thus, we make abstractions to theoriginal action space and obtain 309 actions. We note that some recentstudies also use similar abstraction techniques. The main idea of theabstraction is to make the kicker fuzzy and only focus on the major partof the combination. For example, “33344” is abstracted as “333 **”.When the predicted action of the agent is not legal, the agent willchoose “pass.”. Thus, the current environment is simple, since oncethe agent learns how to play legal actions, it can beat random agents.Users can also encode the actions for their own purposes (such asincreasing the difficulty of the environment) by modifyingdecode_action
function inrlcard/envs/doudizhu.py
. Users are alsoencouraged to include rule-based agents as opponents. The abstractionsin the environment are as below. The detailed mapping of action and itsID is inrlcard/games/doudizhu/jsondata/action_space.json:
Type | Number ofActions | Number ofActions afterAbstraction | Action ID |
---|---|---|---|
Solo | 15 | 15 | 0-14 |
pair | 13 | 13 | 15-27 |
Trio | 13 | 13 | 28-40 |
Trio withsingle | 182 | 13 | 41-53 |
Trio with pair | 156 | 13 | 54-66 |
Chain of solo | 36 | 36 | 67-102 |
Chain of pair | 52 | 52 | 103-154 |
Chain of trio | 45 | 45 | 155-199 |
Plane with solo | 21822 | 38 | 200-237 |
Plane with pair | 2939 | 30 | 238-267 |
Quad with solo | 1326 | 13 | 268-280 |
Quad with pair | 858 | 13 | 281-293 |
Bomb | 13 | 13 | 294-306 |
Rocket | 1 | 1 | 307 |
Pass | 1 | 1 | 308 |
Total | 27472 | 309 |
Payoff¶
If the landlord first get rid of all the cards in his hand, he will winand receive a reward 1. The two peasants will lose and receive a reward0. Similarly, if one of the peasant first get rid of all the cards inhand, both peasants will win and receive a reward 1. The landlord willlose and receive a reward 0.
Simple Dou Dizhu¶
Simple Dou Dizhu is a smaller version of Dou Dizhu. The deck onlyconsists of 6 ranks from ‘8’ to ‘A’ (8, 9, T, J, Q, K, A), there arefour cards with different suits in each rank. What’s more, unlikelandlord in Dou Dizhu, the landlord in Simple Dou Dizhu only has onemore card than the peasants. The rules of this game is the same as therules of Dou Dizhu. Just because each player gets fewer cards, they endthe game faster.
State Representation of Simple Dou Dizhu¶
This is almost the smae as the state representation of Dou Dizhu, butthe number of the ‘deck’ has reduced from 54 to 28, and the number ofthe ‘seen cards’ reduced from 3 to 1. The following table shows thestructure of the state:
Key | Description | Example value |
---|---|---|
deck | A string of one pack of 28 cards without Black Joker and Red Joker. Each character means a card. For conciseness, we use ‘T’ for ‘10’. | 88889999TTTTJJJJQQQQKKKKAAAA |
seen_cards | One face-down card distributed to the landlord after bidding. Then the card will be made public to all players. | K |
landlord | An integer of landlord’s id | 0 |
self | An integer of current player’s id | 1 |
initial_hand | All cards current player initially owned when a game starts. It will not change with playing cards. | 8TTJJQQKA |
trace | A list of tuples which records every actions in one game. The first entry of the tuple is player’s id, the second is corresponding player’s action. | [(0, ‘8’), (1, ‘A’), (2, ‘pass’), (0, ‘pass’)] |
played_cards | As the game progresses, the cards which have been played by the three players and sorted from low to high. | [‘8’, ‘A’] |
others_hand | The union of the other two player’s current hand | 889999TTJJQQKKKAAA |
current_hand | The current hand of current player | 8TTJJQQK |
actions | The legal actions the current player could do | [‘J’, ‘TTJJQQ’, ‘TT’, ‘Q’, ‘T’, ‘K’, ‘QQ’, ‘8’, ‘JJ’] |
State Encoding of Simple Dou Dizhu¶
The state encoding is the same as Dou Dizhu game.
Action Encoding of Simple Dou Dizhu¶
The action encoding is the same as Dou Dizhu game. Because of thereduction of deck, the actions encoded have also reduced from 309 to131.
Payoff of Simple Dou Dizhu¶
The payoff is the same as Dou Dizhu game.
Mahjong¶
throughout the world since 20th century. It is commonly played by 4players. The game is played with a set of 136 tiles. In turn playersdraw and discard tiles until The goal of the game is to complete the leagal hand using the 14thdrawn tile to form 4 sets and a pair. We revised the game into a simpleversion that all of the winning set are equal, and player will win aslong as she complete forming 4 sets and a pair. Please refer the detailon Wikipedia orBaike.
State Representation of Mahjong¶
The state representation of Mahjong is encoded as 6 feature planes,where each plane has 34 X 4 dimensions. For each plane, the column ofthe plane indicates the number of the cards in the given cards set, andthe row of the plane represents each kind of cards (Please refer to theaction space table). The information that has been encoded can berefered as follows:
Plane | Feature |
---|---|
0 | the cards in current player’s hand |
1 | the played cards on the table |
2-5 | the public piles of each players |
Action Space of Mahjong¶
There are 38 actions in Mahjong.
Action ID | Action |
---|---|
0 ~ 8 | Bamboo-1 ~ Bamboo-9 |
9 ~ 17 | Characters-1 ~ Character-9 |
18 ~ 26 | Dots-1 ~ Dots-9 |
27 | Dragons-green |
28 | Dragons-red |
29 | Dragons-white |
30 | Winds-east |
31 | Winds-west |
32 | Winds-north |
33 | Winds-south |
34 | Pong |
35 | Chow |
36 | Gong |
37 | Stand |
Payoff of Mahjong¶
The reward is calculated by the terminal state of the game, wherewinning player is awarded as 1, losing players are punished as -1. Andif no one win the game, then all players’ reward will be 0.
No-limit Texas Hold’em¶
No-limit Texas Hold’em has similar rule with Limit Texas Hold’em. Butunlike in Limit Texas Hold’em game in which each player can only choosea fixed amount of raise and the number of raises is limited. In No-limitTexas Hold’em, The player may raise with at least the same amount asprevious raised amount in the same round (or the minimum raise amountset before the game if none has raised), and up to the player’sremaining stack. The number of raises is also unlimited.
State Representation of No-Limit Texas Hold’em¶
The state representation is similar to Limit Hold’em game. The state isrepresented as 52 cards and 2 elements of the chips of the players asbelow:
Index | Meaning |
---|---|
0~12 | Spade A ~ Spade K |
13~25 | Heart A ~ Heart K |
26~38 | Diamond A ~ Diamond K |
39~51 | Club A ~ Club K |
52 | Chips of player 1 |
53 | Chips of player 2 |
Action Encoding of No-Limit Texas Hold’em¶
There are 6 actions in No-limit Texas Hold’em. They are encoded asbelow.
*Note: Starting from Action ID 3, the action means the amount playershould put in the pot when chooses ‘Raise’. The action ID from 3 to 5corresponds to the bet amount from half amount of the pot, full amountof the pot to all in.
Action ID | Action |
---|---|
0 | Fold |
1 | Check |
2 | Call |
3 | Raise Half Pot |
4 | Raise Full Pot |
5 | All In |
Payoff of No-Limit Texas Hold’em¶
The reward is calculated based on big blinds per hand. For example, areward of 0.5 (-0.5) means that the player wins (loses) 0.5 times of theamount of big blind.
UNO¶
Uno is an American shedding-type card game that is played with aspecially deck.The game is for 2-10 players. Every player starts withseven cards, and they are dealt face down. The rest of the cards areplaced in a Draw Pile face down. Next to the pile a space should bedesignated for a Discard Pile. The top card should be placed in theDiscard Pile, and the game begins. The first player is normally theplayer to the left of the dealer and gameplay usually follows aclockwise direction. Every player views his/her cards and tries to matchthe card in the Discard pile. Players have to match either by thenumber, color, or the symbol/action. If the player has no matches, theymust draw a card. If that card can be played, play it. Otherwise, keepthe card. The objective of the game is to be the first player to get ridof all the cards in hand. For detailed rules, please refer toWikipedia or UnoRules. And in our toolkit, the number ofplayers is 2.
State Representation of Uno¶
In state representation, each card is represented as a string of colorand trait(number, symbol/action). ‘r’, ‘b’, ‘y’, ‘g’ represent red,blue, yellow and green respectively. And at each decision point of thegame, the corresponding player will be able to observe the current state(or information set in imperfect information game). The state consistsof all the information that the player can observe from his view. Weencode the information into a readable Python dictionary. The followingtable shows the structure of the state:
Key | Description | Example value |
---|---|---|
hand | A list of the player’s currenthand. | [‘g-wild’, ‘b-0’, ‘g-draw_2’,‘y-skip’, ‘r-draw_2’, ‘y-3’,‘y-wild’] |
target | The top card in the Discardpile | ‘g-wild’ |
played_cards | As the game progresses, thecards which have been playedby the players | [‘g-3’, ‘g-wild’] |
others_hand | The union of the otherplayer’s current hand | [‘b-0’, ‘g-draw_2’, ‘y-skip’,‘r-draw_2’, ‘y-3’, ‘r-wild’] |
State Encoding of Uno¶
In Uno environment, we encode the state into 7 feature planes. The sizeof each plane is 4*15. Row number 4 means four colors. Column number 15means 10 number cards from 0 to 9 and 5 special cards—“Wild”, “Wild DrawFour”, “Skip”, “Draw Two”, and “Reverse”. Each entry of a plane can beeither 1 or 0. Note that the current encoding method is just an exampleto show how the feature can be encoded. Users are encouraged to encodethe state for their own purposes by modifying extract_state
functionin `rlcard/envs/uno.py`__. The example encodedplanes are as below:
Plane | Feature |
---|---|
0-2 | hand |
3 | target |
4-6 | others’ hand |
We use 3 planes to represnt players’ hand. Specifically, planes 0-2represent 0 card, 1 card, 2 cards, respectively. Planes 4-6 are thesame.
Action Encoding of Uno¶
There are 61 actions in Uno. They are encoded as below. The detailedmapping of action and its ID is inrlcard/games/uno/jsondata/action_space.json
:
Action ID | Action |
---|---|
0~9 | Red number cards from 0 to 9 |
10~12 | Red action cards: skip, reverse, draw 2 |
13 | Red wild card |
14 | Red wild and draw 4 card |
15~24 | green number cards from 0 to 9 |
25~27 | green action cards: skip, reverse, draw 2 |
28 | green wild card |
29 | green wild and draw 4 card |
30~39 | blue number cards from 0 to 9 |
40~42 | blue action cards: skip, reverse, draw 2 |
43 | blue wild card |
44 | blue wild and draw 4 card |
45~54 | yellow number cards from 0 to 9 |
55~57 | yellow action cards: skip, reverse, draw 2 |
58 | yellow wild card |
59 | yellow wild and draw 4 card |
60 | draw |
Payoff of Uno¶
Each player will receive a reward -1 (lose) or 1 (win) in the end of thegame.
Gin Rummy¶
Gin Rummy is a popular two person card game using a regular 52 card deck(ace being low). The dealer deals 11 cards to his opponent and 10 cardsto himself. Each player tries to form melds of 3+ cards of the same rankor 3+ cards of the same suit in sequence. If the deadwood count of thenon-melded cards is 10 or less, the player can knock. If all cards canbe melded, the player can gin. Please refer the detail onWikipedia.
If a player knocks or gins, the hand ends, each player put down theirmelds, and their scores are determined. If a player knocks, the opponentcan layoff some of his deadwood cards if they extend melds of theknocker. The score is the difference between the two deadwood counts ofthe players. If the score is positive, the player going out receives it.Otherwise, if the score is zero or negative, the opponent has undercutthe player going out and receives the value of the score plus a 25 pointundercut bonus.
The non-dealer discards first (or knocks or gins if he can). If theplayer has not knocked or ginned, the next player can pick up thediscard or draw a card from the face down stockpile. He can knock or ginand the hand ends. Otherwise, he must discard and the next playercontinues in the same fashion. If the stockpile is reduced to two cardsonly, then the hand is declared dead and no points are scored.
State Representation of Gin Rummy¶
The state representation of Gin Rummy is encoded as 5 feature planes,where each plane is of dimension 52. For each plane, the column of theplane indicates the presence of the card (ordered from AS to KC). Theinformation that has been encoded can be referred as follows:
Plane | Feature |
---|---|
0 | the cards in current player’s hand |
1 | the top card of the discard pile |
2 | the dead cards: cards in discard pile (excluding the top card) |
3 | opponent known cards: cards picked up from discard pile, but not discarded |
4 | the unknown cards: cards in stockpile or in opponent hand (but not known) |
Action Space of Gin Rummy¶
There are 110 actions in Gin Rummy.
Action ID | Action |
---|---|
0 | score_player_0_action |
1 | score_player_1_action |
2 | draw_card_action |
3 | pick_up_discard_action |
4 | declare_dead_hand_action |
5 | gin_action |
6 - 57 | discard_action |
58 - 109 | knock_action |
Payoff of Gin Rummy¶
The reward is calculated by the terminal state of the game. Note thatthe reward is different from that of the standard game. A player whogins is awarded 1 point. A player who knocks is awarded 0.2 points. Thelosing player is punished by the negative of their deadwood countdivided by 100.
If the hand is declared dead, both players are punished by the negativeof their deadwood count divided by 100.
Settings¶
The following options can be set.
Option | Default value |
---|---|
dealer_for_round | DealerForRound.Random |
stockpile_dead_card_count | 2 |
going_out_deadwood_count | 10 |
max_drawn_card_count | 52 |
is_allowed_knock | True |
is_allowed_gin | True |
is_allowed_pick_up_discard | True |
is_allowed_to_discard_picked_up_card | False |
is_always_knock | False |
is_south_never_knocks | False |
Variations¶
One can create variations that are easier to train by changing theoptions and specifying different scoring methods.
by Nick Christenson
This article originally appeared in the February 2010 issue of TwoPlusTwo Magazine.
Preliminaries
Last time around I discussed a paper on optimal strategies for poker from the University of Alberta's Computer Poker Research Group (CPRG). This month I examine a paperfrom the same group on the issue of opponent modeling. It is titled,'Bayes'Bluff: Opponent Modeling in Poker'. It was presented at the 21st Conference on Uncertainty in Artificial Intelligence (UAI)in 2005.
There are two ways to approach creating a winning poker algorithm,and both are important in understanding the game. The first is tryingto play the game optimally, that is, trying to 'solve' the game. However,as was discussed in the last article in this series, by no means doesplaying optimally guarantee that one will win the most money. It justguarantees you won't lose (or, in the case of negative-sum games, youwill do no worse than lose your share of the rake.) In order to maximizeprofits against opponents that don't play optimally, one must understandtheir strategy and then compute a response against it that maximizesprofit. The first step in doing this, understanding opponents' strategy, is called 'opponent modeling'.
Bayesian Poker
One of the ways to model an opponents strategy is using a mathematicaltechnique called Bayesian statistics. This is based around Bayes theorem.It's easy to misunderstand what Bayes theorem actually says, so I'll tryto explain it as simply as I can.
Suppose there are two events, A and B, and they influence each other.If we find out that B has occurred, then the probability that A willalso occur depends not only on the probabilistic relationship betweenA and B, but also on the probability that A will occur independentlyof the information that B has occurred.
Here's an example. Suppose we know that sometimes after it rains, italso snows. Suppose we go outside and see that it's snowing? Whatdoes that tell us about whether it was raining earlier? If we knowthe odds of it snowing after it rains, that's only part of the picture.We also need to know the odds that it could just snow without raining.If we have both of these pieces of information, then using Bayesianstatistics we can make a prediction about the likelihood that it wasraining earlier.
In the abstract case, if we don't know anything about whether or notB occurred, then in a vacuum we may still make a prediction about theprobability that A will occur. In the example this corresponds to the probability that it might have just rained today. This probability distribution is called a 'prior', as it's what we think the likelihood of some event is 'prior' to having any additional information that would influence our opinion.
If we know B, or whether or not it's snowing in the example, that can change our estimate of whether or not it might have been raining earlier. Thisnew probability distribution is called the 'posterior' (meaning 'after').Yes, statistics students make jokes about this. On rare occasion they'reeven funny. I bring this whole thing up because in the 'Opponent Modeling' paper I'm discussing, if you don't understand what 'prior' and 'posterior'mean you'll be completely lost.
Sometimes Bayes' theorem sounds like simple if/then conditional probabilities. It's more than that. This can be confusing. In any case, this technique can be used to model poker opponents. If we know what players show down and what percentage of the time they adopt variousbetting strategies, we can make some estimates about how they wouldplay various hands. From that make predictions about what hand ranges they might hold.
Introduction
The paper begins with the authors setting the stage for whatthey're going to demonstrate. Assuming our opponent plays sub-optimally, to maximize profit we must take two steps:
- Figure out our opponent's strategies.
- Determine a best response to their plays, including their mistakes.
Several approaches to this problem are possible, and good solutions aredifficult. This difficulty has several causes. First, we won't everhave complete knowledge of our opponents' strategy. This is due both tothe fact that we'll never have enough information to fill in all the gaps, and it's unlikely that our opponent will continue to play the way they always have.
Poker
The second section of the paper is more routine for poker players thanit is for mathematicians or artificial intelligence researchers. Theauthors first define the characteristics of a generic hold 'em game: Each player is dealt one or more hole cards, there will be one or more board cards, there will be two or more rounds of betting, and between rounds of betting one or more additional board cards will be revealed. The authors are examining two-player limit Texas hold 'em, as has traditionallybeen the focus of the CPRG folks.
In addition to looking at Texas hold 'em, they also examining anabbreviated game, which they call 'Leduc hold 'em'. Leduc is a smalltown near Edmonton, near the home of the University of Alberta and the CPRG. This game is similar in purpose to Rhode Island hold 'em, which was explained in an earlier article. The naming is essentially the same joke. I guess the difference is that Canada doesn't have any truly small provinces.For research purposes, the nice thing about Leduc (or Rhode Island) hold 'em is that it has many of the same characteristics of Texas hold 'em, but is small enough that optimal strategies can be determined explicitly.
The authors then go on to discuss some of the difficulties in creatinggood poker playing algorithms, but I won't discuss those here becausewhat they have to say is quite accessible in the paper.
Modeling the Opponent
When modeling the opponent, the authors assume their opponent's strategy is 'stationary', that is, it's not changing overtime. Of course, this is a bad assumption to make for poker players in general. It precludes dealing with players who learn over time,and it precludes adjusting effectively to players who change gears,but it's a good place to start. As an extension of this, they also assume hands are 'i.i.d.', which stands for 'independent and identically-distributed'. This means that in addition to nothaving the effects of one hand influence subsequent hands, the game is fair.
Before we can build a strategy for our opponent, we must come up witha representation for the data we'll store about each hand in somedatabase. The authors explain that for each sample hand they're usingto derive an opponent strategy they're storing hole card informationfor both players (when available), board cards, and the bet sequencefor each betting round.
The idea is that when we see a given hand being played, we can compareit to previous hands in our database, and come up with a Bayesianprobability distribution for the types of hands our opponent might have. This distribution is what is called the 'posterior'.
Responding to the Opponent
Having a distribution of what we think our opponent's possible handsmight be is only half the battle. Now we must compute a response tothis hand range. The authors consider several options:
Bayesian Best Response (BBR) - This is a direct computation of thehighest EV response given the expected distribution of all possible hole cards. The authors point out that solving this is equivalent to solving something called theExpectimax algorithm.The Expectimax algorithm is the solution to a game tree where we are pickingeach node on the tree such that we expect to maximize our expectation forthe game. It's sort of a multi-round generalization of the Minimaxalgorithm of elementary game theory. The problem with BBR is that it'svery difficult to compute, even if you have all the information you need.So, for a 'real' poker game, this isn't practical. The best we can dois find approximations for BBR.
Max a posteriori response (MAP response) - 'a posteriori' is Latin for'from the latter', meaning 'from effects to causes'. In this method what we do is determine the best possible result for our opponent given the strategy they're playing, and then just calculate what our best responseis to that strategy. For complex games this isn't trivial either, but it's much easier than computing a BBR solution.
Thompson's Response - Here we pick a strategy from our opponent's distribution of previous hands weighted by Bayes theorem and play a best response to that particular strategy. In some sense, with MAPwe find a counter-strategy to the worst-case scenario. With Thompson'swe find a counter-strategy to the most likely scenario.
Later in the paper the authors will compare each of these methods of determining a response to an opponent's play.
Priors
Leduc Hold'em Game
What we call the 'prior' is equivalent to what most poker players would call an opponent's hand database. Without a good prior we can't make goodguesses about how our opponent's actions correspond with previous play,and without this information we have no chance of calculating a profitable response.
A good prior should have several properties. It should capturethe strategy of our opponent. It should be set up to make it easy tocalculate our posterior. It's structure should make it easyto calculate good responses. Given all this, it should be as smallas possible. Obviously, some of these work in opposition to each other.Generating a good prior is not simple, and I'm not aware of a generalapproach to creating or improving a prior.
The authors explore different priors for the two different poker games. For Leduc hold 'em they use a Dirichlet distribution. This is a special kind of probability distribution that is appropriate for this situation. Unfortunately, I don't know how to explain the details of why they chose this particular distribution in simple terms, so you'll just have to trust the authors here. In any case, the game space is small enough so they can take a good sample of an opponent's play that spans many possible hand outcomes.
The second prior, which they use for the Texas hold 'em samples, is what they call an 'informed' prior. That is, a skilled player selectswhat he or she feels is a subset of hands that adequately represents someother player's strategy. They use these samples to create several parameters to define how an opponent plays. These are described in Table 1 of the paper. Many of these will seem quite familiar to anyone who has used opponent data gathering software designed for online poker. The parametersthe authors use include the following: