Superhuman AI for Multiplayer Poker: Difference between revisions
Line 19: | Line 19: | ||
== Experimental Results == | == Experimental Results == | ||
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played. | |||
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format. | |||
Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable. | |||
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014. | |||
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable. | |||
== Discussion == | == Discussion == |
Revision as of 12:18, 17 November 2020
Presented by
Hansa Halim, Sanjana Rajendra Naik, Samka Marfua, Shawrupa Proshasty
Introduction
In the past two decades, most of the superhuman AI that were built can only beat human players in two-player zero-sum games. More specifically, in the game of poker we only have AI models that can beat them in two-player settings. Poker is a great challenge in AI and game theory because it captures the challenges in hidden information so elegantly. This means that developing a superhuman AI in multiplayer poker is the remaining great milestone in this field. In this paper, the AI whom we call Pluribus, is capable of defeating human professional poker players in Texas hold'em poker which is a six-player poker game and is the most commonly played format in the world.
Challenges of Multiplayer Games
Many AI have reached superhuman performance in games like checkers, chess, two-player limit poker, Go, and two-player no-limit poker. The most common strategy that the AI use to beat those games is to find the most optimal Nash equilibrium. A Nash equilibrium is the best possible choice that a player can take, regardless of what their opponent is going to choose. Nash equilibrium has been proven to always exists in all finite games, and the challenge is to find the equilibrium. To summarize, Nash equilibrium is the best possible strategy and is unbeatable in two-player zero-sum games, since it guarantees to not lose in expectation regardless what the opponent is doing.
The insufficiency with current AI systems is that they only try to achieve Nash equilibriums instead of trying to actively detect and exploit weaknesses in opponents. For example, let's consider the game of Rock-Paper-Scissors, the Nash equilibrium is to randomly pick any option with equal probability. However, we can see that this means the best strategy that the opponent can have will result in a tie. Therefore, in this example our player cannot win in expectation.
Now let's try to combine the Nash equilibrium strategy and opponent exploitation. We can initially use the Nash equilibrium strategy and then change our strategy over time to exploit the observed weaknesses of our opponent. For example, we switch to always play Rock against our opponent who always plays Scissors. However, by shifting away from the Nash equilibrium strategy, it opens up the possibility for our opponent to use our strategy against ourselves. For example, they notice we always play Rock and thus they will now always play Paper.
Theoretical Analysis
Lorem Ipsum Bla bla bla
Experimental Results
To test how well Pluribus functions, it was tested against human players in 2 formats. The first format included 5 human players and one copy of Pluribus (5H+1AI). The 13 human participants were poker players who have won more than $1M playing professionally and were provided with cash incentives to play their best. 10,000 hands of poker were played over 12 days with the 5H+1AI format by anonymizing the players by providing each of them with aliases that remained consistent throughout all their games. The aliases helped the players keep track of the tendencies and types of games played by each player over the 10,000 hands played.
The second format included one human player and 5 copies of Pluribus (1H+5AI) . There were 2 more professional players who split another 10,000 hands of poker by playing 5000 hands each and followed the same aliasing process as the first format. Performance was measured using milli big blinds per game, mbb/game, (i.e. the initial amount of money the second player has to put in the pot) which is the standard measure in the AI field. Additionally, AIVAT was used as the variance reduction technique to control for luck in the games, and significance tests were run at a 95% significance level with one-tailed t-tests as a check for Pluribus’s performance in being profitable.
Applying AIVAT it was found that for the first format, Pluribus won 48 mbb/game on average with standard error 25 mbb/game. This far exceeds the expected win-rate for 6-player Texas hold’em poker. The p-value of Pluribus being profitable in this format was 0.028. With the second format, Pluribus won 32 mbb/game on average with a standard error of 15 mbb/game and was determined as profitable with a p-value of 0.014.
As Pluribus’s strategy was not developed with any human data and was trained by self-play only, it is an unbiased and different perspective on how optimal play can be attained. A standard convention of “limping” in poker is confirmed to be not optimal by Pluribus since it initially experimented with it but eliminated this from its strategy over its games of self-play. On the other hand, another convention of “donk betting” that is dismissed by players was adopted by Pluribus much more often than played by humans, and is proven to be profitable.
Discussion
Lorem Ipsum Bla bla bla
Conclusion
Lorem Ipsum Bla bla bla
Critiques
Lorem Ipsum Bla bla bla
References
[1] Lorem Ipsum Bla bla bla [2] Lorem Ipsum Bla bla bla