A theoretical poker contest pitting several large language models (LLMs) against each other in a cash game wrapped up last Monday. When the dust settled after 3,799 no-limit hold’em hands, OpenAI (creators of popular generative AI ChatGPT) had narrowly taken the PokerBattle AI crown over Claude Sonnet 4.5 and Elon Musk’s Grok.
Llama 4, created by Meta (Facebook/Instagram), produced the worst results by far. It burned through its entire $100,000 bankroll before the contest ended. Blinds for the theoretical poker game were $10-$20, so Llama lost 50 buy-ins across a rather short sample.
PokerScout reached out to PokerBattle AI creator Max Pavlov with a series of questions regarding the experiment. He had not provided a response as of Thursday afternoon.
Keep Reading
PokerBattle AI Full Results
Here’s a full look at how each LLM finished in PokerBattle AI.
| Place | LLM | Result | Hands Played |
|---|---|---|---|
| 1 | OpenAI o3 | +$36,691 | 3,799 |
| 2 | Claude Sonnet 4.5 | +$33,641 | 3,799 |
| 3 | Grok | +$28,796 | 3,799 |
| 4 | DeepSeek R1 | +$18,416 | 3,799 |
| 5 | Gemini 2.5 Pro | +$14,655 | 3,799 |
| 6 | Mistral Magistral | +$3,281 | 3,799 |
| 7 | Kimi K2 | -$14,370 | 3,799 |
| 8 | Z.AI GLM 4.6 | -$21,510 | 3,799 |
| 9 | Meta Llama 4 | -$100,000 | 3,501 |
The massive loss by Llama allowed six of the nine bots in the game to profit.
It’s worth noting that the AIs were allowed to adjust to their opponents’ games. They were allowed to take notes and given game stats such as each player’s VPIP (rate of voluntarily putting money in the pot). Thus, with Llama effectively acting as the game’s “whale,” the results may reflect which LLM most efficiently exploited it, rather than which was playing the closest to a theoretically optimal strategy.
It’s also worth keeping in mind that 3,799 hands is not even close to a significant sample size in poker. LLMs play poker very slowly, explaining their entire “thought process” in words instead of just spitting out an action.
A human player multi-tabling could get through that many hands in as little as one day. Results over a sample size of just a few thousand hands will exhibit a high degree of variance and may not reflect long-term performance.
When PokerScout queried several LLMs to test out how they approached a poker hand, a paid subscription to OpenAI’s ChatGPT did seem to provide the most coherent advice. In that sense, at least, the PokerBattle AI results seem to fit with first-hand, subjective evaluation.
Still, the quality of ChatGPT’s strategic reasoning wasn’t all that strong. Poker players needn’t fear a sudden influx of ChatGPT-powered opponents in their games. These LLMs remain a long way from producing poker strategy that threatens safe poker sites in the same manner as other real-time assistance (RTA) tools.






