Model Release Synthesized from 2 sources

The Pelican Test: When Open Beats Closed on Raw Power

Key Points

• Opus 4.7: improved coding, image analysis, less hand-holding
• Qwen3.6-35B-A3B beats Opus 4.7 on pelican benchmark running on MacBook Pro M5
• 20.9GB quantized model ran locally via LM Studio, no API needed
• Mythos Preview is Anthropic's actual flagship, not Opus 4.7
• Open-weight models closing capability gap with closed APIs

References (2)

[1] Anthropic releases Claude Opus 4.7, its most powerful GA model — The Verge AI ↗
[2] Developer Runs Qwen3.6 on Laptop, Beats Claude Opus — Simon Willison's Weblog ↗

Anthropic released Claude Opus 4.7 on Thursday, calling it the company's most powerful "generally available" model to date. Twelve hours later, a developer ran a pelican on a bicycle through a 35-billion-parameter model running on his laptop—and the open-weight contender drew it better than Opus 4.7 could.

That contradiction sits at the heart of a shift the AI industry is only beginning to reckon with. The "most powerful model" title is becoming a moving target, and the direction of movement favors openness.

Opus 4.7's Improvements

To be clear, Opus 4.7 is not a minor update. Anthropic claims meaningful gains in advanced software engineering, complex coding tasks, image analysis, and creative work like generating slides and documents. The company specifically highlights reduced "hand-holding"—fewer iterations needed to get acceptable outputs on challenging engineering problems. For enterprise customers running Anthropic's API, these are concrete productivity gains worth noticing.

But Anthropic itself has already acknowledged the limits of the "most powerful" framing. Earlier this month, the company announced Mythos Preview, a cybersecurity-focused model it explicitly calls its most powerful model overall. Opus 4.7 sits below that ceiling—powerful within its tier, but not the apex. Meanwhile, Qwen3.6-35B-A3B, Alibaba's open-weight release, just ran circles around Opus 4.7 on Simon Willison's now-infamous pelican benchmark, generating a recognizable bicycle frame where Opus 4.7 consistently produced structural nonsense.

The pelican test is absurd. It is also instructive.

What Local Gets You

Willison ran Qwen3.6 on a MacBook Pro M5 using LM Studio, with the model quantized to 20.9GB. No API costs. No data leaving his machine. No rate limits. The model fit in his pocket and outperformed a cloud-hosted frontier model on a visual task.

This is not an isolated party trick. It represents a trajectory. Open-weight models have closed the capability gap with closed APIs faster than most predictions suggested. Qwen3.6's performance on image tasks—long considered a weakness of smaller models—suggests that architectural improvements and training data quality are doing the heavy lifting that parameter count used to do alone.

For developers, researchers, and cost-sensitive teams, this changes the calculus. "Should we use the most capable model?" is giving way to "Should we use the most capable model we can run ourselves?" The answer increasingly leans toward self-hosting.

The Definition Game

Anthropic's positioning of Opus 4.7 as "generally available" matters here. It signals reliability, enterprise support, and predictable pricing—things open-weight models still struggle to offer consistently. A researcher running experiments on a laptop and an enterprise deploying at scale face different constraints. Closed models trade flexibility for assurance.

But that trade-off is eroding. As open-weight models gain parity on task-specific benchmarks, the "most powerful" title increasingly belongs to whoever publishes the best leaderboard result on any given day. Today that's Qwen3.6 on pelicans. Tomorrow it might be something else.

The model that runs locally, privately, and cheaply is winning hearts even when it cannot yet win every contest. That is the paradox Anthropic's release inadvertently exposes: the flagship grows taller precisely as the deck chair of local capability rises toward it.