Eight top AI systems—all failed to turn a profit over a Premier League season.
General Reasoning, a London-based startup, ran these models through what they call KellyBench: a rigorous simulation where each AI received the same historical Premier League data from 2023-24 and was tasked with optimizing bets using the Kelly criterion, a mathematical framework for sizing wagers based on perceived edge. The eight frontier models tested included systems from Google, OpenAI, Anthropic, and xAI. None returned a profit. Most lost money at rates that would have wiped out a bettor within weeks.
The methodology was not simplistic. Each model received detailed historical performance data, injury reports, home-and-away splits, and head-to-head records—everything a sophisticated sports bettor would analyze. They were then instructed to build prediction models and manage their bankroll according to mathematical risk principles. The bar was set high: beat the market odds that already reflect millions of informed wagers.
xAI's Grok 3 performed worst, losing at a rate that outpaced even random betting. This finding cuts against the narrative that newer, larger models inherently solve harder problems. Grok lost money faster than models released years earlier. General Reasoning's team suggests one explanation: newer models may have absorbed more noise from sports media narratives, which are notoriously prone to overreaction and recency bias.
The implications extend beyond soccer. Premier League betting represents one of the cleanest prediction problems in existence: clear outcomes, massive historical datasets, active markets that price in continuous information. If these models cannot reliably extract value here, the gap between "impressive demo" and "useful tool" deserves scrutiny. The study suggests that current AI architectures struggle with temporal reasoning—tracking how team strength evolves across a season, accounting for momentum shifts, and updating beliefs in response to new information. These are precisely the skills that make humans dangerous in prediction markets.
The study also raises questions about what benchmarks measure. State-of-the-art models score near-perfect on reasoning tests like MMLU and HumanEval—tasks that are well-defined, static, and independently verifiable. Real-world prediction is messier: the environment changes while you're reasoning about it, and the cost of being wrong compounds over time. KellyBench offers a counterweight to benchmarks that reward pattern-matching rather than genuine forecasting ability.
For the AI industry, the message is uncomfortable. Chasing parameter counts and training compute may improve performance on tasks where scale helps. But building systems that reliably outperform human judgment on continuous, high-stakes problems remains unsolved. The Premier League study suggests we are still early.