Research Synthesized from 1 source

13B Models Beat Giants When Constrained by Formal State Machines

Key Points

• Statewright constrains AI agents via formal state machines, not prompts
• 13-20B models outperform unconstrained counterparts on SWE-bench
• Haiku, Sonnet, and Opus all show improvements with constrained tool access
• Context utilization beats raw context size in reliability testing
• Rust engine handles orchestration with zero LLM involvement
• MCP integration supports Claude Code, Codex, and Cursor

References (1)

[1] Statewright uses formal state machines to make AI agents reliable — Hacker News AI ↗

A junior developer watches their Claude Code setup fail for the third time today. The model reaches into the wrong directory, runs a test suite it shouldn't touch, and spirals into a loop of increasingly irrelevant commands. This isn't an edge case—it's the norm. AI agents work brilliantly in demos and catastrophically in production, and the industry's standard response has been to throw more compute at the problem. Statewright, a new open-source project from former NVIDIA and AMD Distinguished Engineer Ben Cochran, suggests the real solution lies in the opposite direction: make the problem smaller by constraining what models can do.

Cochran spent twenty years in hardware verification, where teams use formal methods—mathematically rigorous techniques for proving circuits behave correctly—before any chip ever leaves the fab. His insight was direct: AI agents fail because their solution spaces are unbounded, and the industry's response has been to brute-force reliability through massive parameter counts and longer context windows. What if the same formal constraints used to verify hardware could tame agentic chaos?

Statewright replaces prompt engineering with protocol enforcement. Its core is a Rust engine that evaluates state machine definitions—states, transitions, guards, and tool restrictions—with zero LLM involvement in orchestration. A planning state grants only read-only tools. An implementation state unlocks edit tools scoped to prevent runaway changes, plus constrained bash commands. A testing state permits only test execution. The model physically cannot skip steps or access tools out of phase. It's enforced through deterministic code, not gentle suggestions in a system prompt.

The results on SWE-bench surprised even Cochran. Models in the 13-20B parameter range consistently solve problems more reliably than their unconstrained counterparts, across multiple families including Haiku, Sonnet, and Opus. Sonnet and Opus generate fewer tokens while achieving higher success rates and fewer death spirals. The critical finding: context window utilization matters more than raw context size. A tightly scoped working context at each step outperforms a model given unrestricted access to everything.

Below the 13B inflection point, models can navigate the state machine but can't retain enough context to produce accurate edits. Above it, constrained models consistently punch above their weight. Fine-tuning alone did not yield comparable functional improvements in Cochran's testing—deterministic constraint beats iterative training.

Statewright integrates with Claude Code via MCP, with Codex and Cursor support in development. When a workflow activates, hooks enforce guardrails automatically. The model sees five tools instead of dozens, receives explicit instructions for the current phase, and gets told when it attempts out-of-scope operations. This plugin layer abstracts the complexity, making formal state machines accessible without rewriting agent logic from scratch.

The broader implication is uncomfortable for an industry built on "bigger is better." If formal constraints can make 13B models match the reliability of 70B+ systems, the economic calculus changes entirely. Smaller models are cheaper to run, faster to inference, and—under the right constraints—more trustworthy. Cochran's approach suggests that AI agent reliability doesn't require the next frontier model. It requires developers willing to build with explicit boundaries instead of hoping language models self-correct.

Whether the industry will embrace formal methods over frontier-scale brute force remains the defining question. Statewright's SWE-bench results suggest the answer should be obvious.