Research Synthesized from 1 source

Gemini Beat Pokemon in 7 Months; Humans Finish in Hours

Key Points

• Gemini 2.5 Pro spent 7 months on Pokemon Blue; humans finish in under 20 hours
• Coding's immediate feedback loops don't transfer to video game learning
• AlphaZero required full retraining for each game despite structural similarities
• LLMs beat Pokemon only by memorizing human walkthroughs, not general reasoning
• External scaffolding required to bridge text reasoning to game inputs

References (1)

[1] LLMs still fail at video games despite coding wins — IEEE Spectrum AI ↗

In May 2025, Gemini 2.5 Pro became the first LLM to complete Pokemon Blue. The journey took seven months. A human child typically finishes the same game in under twenty hours. This gap exposes a fundamental flaw in the assumption that AI's coding breakthroughs would cascade into other cognitive domains.

Julian Togelius, director of New York University's Game Innovation Lab, has spent years studying exactly how AI fails at games. His research, and a recent paper examining the broader state of LLM capabilities, delivers a blunt verdict: coding success and game-playing ability are not the same problem. "Coding is extremely well-behaved," Togelius told IEEE Spectrum. "You have tasks. These are like levels. You get a specification, you write code, and then you run it. The reward is immediate and granular." Video games offer no such luxury. They demand learning through play—discovering mechanics through trial, error, and exploration that current architectures cannot replicate.

The distinction matters because it challenges a prevailing industry narrative. Companies have pitched AI agents capable of navigating software interfaces, executing multi-step tasks, and "learning" from failure. If an LLM can write production code, the thinking goes, surely it can navigate a digital environment. Togelius's data says otherwise. When his lab tests LLMs against game-playing benchmarks, the models consistently stumble on spatial reasoning, long-horizon planning, and adaptive strategy—skills coding doesn't require in the same way.

Part of the problem is architectural. Chess and Go, where AI has excelled, present games as well-defined state spaces with legal moves. Video games are messier. They layer physics, narrative, inventory systems, and emergent mechanics on top of one another. Each game essentially requires its own learning regime. Google's AlphaZero, often cited as evidence of general game-playing AI, actually required full retraining for each game it learned—Go, chess, and shogi—despite their structural similarities. For genuinely different games, the engineering demands multiply.

There's also a data problem that benchmarks don't solve. Popular games like Minecraft and Pokemon have accumulated millions of hours of human-generated guides, walkthroughs, and forum discussions. LLMs trained on this data inherit a ghost of human play. A lesser-known indie title offers almost nothing. When a model encounters a game without an established solution corpus, it flounders in ways that reveal the poverty of its "understanding." Beating Pokemon Blue wasn't evidence of general capability—it was evidence of a model that had memorized human solutions to Pokemon Blue.

The industry has tried to paper over this limitation with scaffolding. Gemini 2.5 Pro required custom software to interact with Pokemon Blue at all—translation layers that bridged the gap between the model's text-based reasoning and the game's button-input environment. This isn't playing a game. It's having a conversation about playing a game while an external system handles the actual inputs. Togelius puts it plainly: "We do not have general game AI."

This doesn't mean the research is worthless. Game environments remain valuable test beds precisely because they stress AI in ways that coding benchmarks don't. But companies hyping "agentic" AI systems should be clear about what those systems can and cannot do. A model that writes correct Python is not necessarily a model that can navigate your operating system, book your travel, or complete a task that requires real-time adaptation. The gap between syntax and agency remains vast.

For now, the seven-month Pokemon run stands as a useful benchmark for the gap between AI rhetoric and AI reality.