Anthropic released Claude Opus 4.7 on Thursday, calling it the company's most powerful "generally available" model to date. Twelve hours later, a developer ran a pelican on a bicycle through a 35-billion-parameter model running on his laptop—and the open-weight contender drew it better than Opus 4.7 could.
That contradiction sits at the heart of a shift the AI industry is only beginning to reckon with. The "most powerful model" title is becoming a moving target, and the direction of movement favors openness.
Opus 4.7's Improvements
To be clear, Opus 4.7 is not a minor update. Anthropic claims meaningful gains in advanced software engineering, complex coding tasks, image analysis, and creative work like generating slides and documents. The company specifically highlights reduced "hand-holding"—fewer iterations needed to get acceptable outputs on challenging engineering problems. For enterprise customers running Anthropic's API, these are concrete productivity gains worth noticing.
But Anthropic itself has already acknowledged the limits of the "most powerful" framing. Earlier this month, the company announced Mythos Preview, a cybersecurity-focused model it explicitly calls its most powerful model overall. Opus 4.7 sits below that ceiling—powerful within its tier, but not the apex. Meanwhile, Qwen3.6-35B-A3B, Alibaba's open-weight release, just ran circles around Opus 4.7 on Simon Willison's now-infamous pelican benchmark, generating a recognizable bicycle frame where Opus 4.7 consistently produced structural nonsense.
The pelican test is absurd. It is also instructive.
What Local Gets You
Willison ran Qwen3.6 on a MacBook Pro M5 using LM Studio, with the model quantized to 20.9GB. No API costs. No data leaving his machine. No rate limits. The model fit in his pocket and outperformed a cloud-hosted frontier model on a visual task.
This is not an isolated party trick. It represents a trajectory. Open-weight models have closed the capability gap with closed APIs faster than most predictions suggested. Qwen3.6's performance on image tasks—long considered a weakness of smaller models—suggests that architectural improvements and training data quality are doing the heavy lifting that parameter count used to do alone.
For developers, researchers, and cost-sensitive teams, this changes the calculus. "Should we use the most capable model?" is giving way to "Should we use the most capable model we can run ourselves?" The answer increasingly leans toward self-hosting.
The Definition Game
Anthropic's positioning of Opus 4.7 as "generally available" matters here. It signals reliability, enterprise support, and predictable pricing—things open-weight models still struggle to offer consistently. A researcher running experiments on a laptop and an enterprise deploying at scale face different constraints. Closed models trade flexibility for assurance.
But that trade-off is eroding. As open-weight models gain parity on task-specific benchmarks, the "most powerful" title increasingly belongs to whoever publishes the best leaderboard result on any given day. Today that's Qwen3.6 on pelicans. Tomorrow it might be something else.
The model that runs locally, privately, and cheaply is winning hearts even when it cannot yet win every contest. That is the paradox Anthropic's release inadvertently exposes: the flagship grows taller precisely as the deck chair of local capability rises toward it.