GenAI Poker Tournament Should Be an Entertaining Trainwreck

At the end of October, a Russian programmer plans to pit several large language models (LLMs) against each other in a lengthy game of poker to see which one is the best. However, if PokerScout’s experiments on a couple of prominent chatbots are any indication, the quality of play in the study will be lower than in an average $1-$2 game.

Max Pavlov is behind PokerBattle.ai. His plan? A simulated all-AI cash game, featuring prominent entities like Grok, Gemini, Claude, and OpenAI (ChatGPT). They’ll battle each other Oct. 27-Nov. 3 using the following parameters:

$10-$20 No-Limit Hold’em cash game
No antes, straddles, or blind increases
Up to nine-handed tables
100-big-blind starting stacks
Top up when below 50 big blinds

Whichever LLM ends the week with the biggest bankroll will be declared the winner.

Pavlov explained his rationale for the experiment:

LLMs naturally seem like a tool that could help with learning — by breaking down hands,explaining decisions and essentually integrating all the different parts of the game into one coherent whole. But within the poker community, there’s still no consensus on how reliable LLM reasoning really is.

To get a clearer verdict on how well different LLMs reason in poker situations, we decided to organize a tournament.

PokerScout tested some of the AIs that are expected to participate, using the training mode in the solver GTO Wizard.

How LLMs Make Poker Decisions

LLMs differ significantly from solvers like GTO Wizard. Rather than optimizing a poker decision by playing it out hundreds of thousands of times to see which lines generate the most expected value, LLMs are trained on vast troves of publicly available information. Thus, their poker strategies are informed by whatever they “read” when they were learning, some of which might not be correct.

Moreover, they work based on statistical guesses about how likely a word is to occur in certain contexts. Unlike an AI-based poker solver, there’s no mathematical simulation of the game of poker itself happening under the hood of an LLM.

Newer versions use multi-step reasoning, breaking down a problem into sub-contexts. For instance, ChatGPT understands that it has to evaluate things like stack-to-pot ratio, ranges, and hand equity first, then combine those ideas to generate its final output.

Compared to a few years ago, the chatbots’ output at least sounds a lot more like a player that understands the game. However, their actual understanding of poker remains quite limited, PokerScout found.

PokerScout decided to conduct its own experiment to see how a few LLMs performed when presented with a decision tree via a GTO Wizard practice hand. PokerScout used Gemini 2.5 Pro (“reasoning, math, and code”) and Grok Expert (“thinks hard”). OpenAI’s ChatGPT — a listed participant in PokerBattle AI — did not seem to properly grasp the queries at first, instead turning the question around and asking for the user’s thought process. However, PokerScout was able to garner more appropriate responses upon a second effort, using a paid version of that particular LLM and a more structured query.

The LLMs received prompts about the poker hand below, and PokerScout recorded some findings that showcased how limited these programs are when it comes to poker strategy. We used a setup resembling that of PokerBattle AI (two-blind cash game, 100 big blinds, short-handed).

Case Study: LLMs Play a Poker Hand

This hand was a single-raised pot, with an under-the-gun villain opening to 2.5 big blinds. Hero defended the big blind with Kc-7c. The flop came 9s-8h-6d, and some of the LLMs began misevaluating the situation immediately.

Two suggested the hero check, and one recommended betting, so they began diverging right away.

Flop: 9s-8h-6d

Grok said: According to Grok, hero had “king-high with a backdoor flush draw and a gutshot straight draw.” Both it and Gemini said the hero needs a 10 for a straight, seemingly oblivious to the possibility of a 5 completing one as well.

Grok recommended checking.

Gemini said: Gemini, meanwhile, said the hero had “a gutshot straight draw” and “two overcards.” This despite the fact that the 7 is clearly not an overcard.

Gemini suggested that hero should check their entire range, but the solver mixes bets and checks with a variety of hands.

Gemini also said hero should be prepared to check-fold to a standard bet. In actuality, the solver continues with all of its hands containing a 7 here, frequently bluff raising against a bet. The villain’s bet size doesn’t affect the mechanic of continuing with all hands containing a 7 — the solver continues in some fashion against all bet sizes.

ChatGPT said: Bet 2 big blinds. ChatGPT recommended the aggressive play because the flop “smashes the big blind’s range relative to UTG’s.”

However, this isn’t true, as the big blind is actually at very slight equity disadvantage according to GTO Wizard (49.48% vs. 50.52%).

Still, K-7 combos bet more than they check, though the split isn’t very large, meaning there’s no meaningful difference in EV. So, ChatGPT appeared to give the most correct, nuanced response.

GTO Wizard said: Bet small or check.

Actual action: Hero checked, villain checked.

Turn: Kd

Grok said: Bet small.

Grok still appeared confused about the basics of the situation. In one sense, it at least outlined the correct hand for hero, saying hero had top pair with a mediocre kicker. However, it still bizarrely said hero had a gutshot with (somehow) a backdoor flush draw. Confusingly, it also thought 7-5 was an “open-ender.”

Grok recommended betting 2.75 big blinds (50% of the pot). It explained that the hero has roughly 60-70% equity with Kc-7c against a checking range.

Again, though, its explanation for its decision shows faulty reasoning. For one thing, it said a 50% bet is “standard” for “a polarizing hand.” Polarizing bets are actually larger — this size is more of a merging one.

It also dismissed the larger sizes as “too aggressive deep-stacked.” In reality, overbets accounted for two of the three most solver-preferred sizes, and overbets are going to become more common, not less, at deeper stacks.

Gemini said: Bet big.

Like Grok, Gemini correctly noted hero had top pair with a weak kicker.

Gemini recommended betting 4.1 big blinds (~75% of the pot). It explained that the hero should “very likely have the best hand” and should seek value while denying a free card to hands like Qd-Jd.

However, it notably thought the opponent was unlikely to have overpairs. A GTO opponent actually checks back QQ-AA more than 50% of the time, so that assumption isn’t correct.

ChatGPT said: Check.

ChatGPT correctly noted that the Kd improves the under-the-gun player’s equity (big blind dropped from 49.48% to 46.03%). Although all three lines are acceptable, the check should be the most frequent, so we can give ChatGPT the points here.

It did not produce correct responses when projecting what to do after checking, however. ChatGPT recommended calling only small or medium bets. The solver doesn’t fold K-7 against any bet size, including an all-in shove.

ChatGPT also appeared confused about river play. It recommended caution on a 5 or a 10, despite the fact that the hero would make straights on these cards.

GTO Wizard said: Mostly check, mixing in some betting, both big and small. Betting 4.1 big blinds has ever-so-slightly more EV according to the solver, but all options are similar. Thus, it prefers to check to balance its overall check-heavy strategy (81.4% checks).

Actual action: hero bet 4.1 big blinds, villain folded.

LLMs Aren’t Equipped to Provide Valuable Poker Strategy Advice

This is a fairly straightforward hand that doesn’t even reach the river. But its simplicity, and the fact that the LLMs consistently misunderstood basic aspects of it, display a clear truth: these programs aren’t equipped to provide poker strategy advice that has any value.

A paid ChatGPT subscription appeared to give the most nuanced poker advice, but it still misunderstood incredibly obvious things about the hand.

Real-time assistance (RTA) is certainly a concern for poker players worried about online cheating. Programs like GTO Wizard can provide optimal plays, with prominent pros having been caught using it while playing.

However, any such RTA that comes from an LLM looks as likely to hurt as it is to help.

PokerBattle.ai is a fun experiment. But poker players shouldn’t eagerly await the winner in hopes of asking it strategy questions in the future. Leave the poker strategy to the solvers and the experts, and use the LLMs to help answer basic search queries or other simple functions.