What does it actually take for an AI agent to hold down a job? Not pass a demo, not complete a single task in a controlled setting—but reliably execute the stream of judgment calls, context switches, and mid-process corrections that define real work. VAKRA, a new analysis from Hugging Face researchers, offers the most systematic answer yet: agents fail in predictable ways, and we can finally measure exactly where they break down.
The study tested agent performance across 47 professional scenarios, from calendar management and meeting coordination to multi-step software workflows and customer support triage. The results reveal consistent failure patterns—not random errors, but structured breakdowns that recur regardless of the underlying model.
The taxonomy VAKRA establishes documents five core failure modes. Agents struggle most with context retention over extended conversations, frequently losing track of earlier instructions. They exhibit intent misalignment, acting on surface-level cues rather than underlying user goals. Tool orchestration failures appear when agents either over-rely on available tools or fail to use them when necessary. Error compounding occurs when small mistakes cascade into larger failures. Finally, agents demonstrate recovery blindness—they often cannot recognize when they have reached the limit of their capabilities.
"The pattern is consistent: agents struggle to generalize from what they've learned," the researchers note. This gap between benchmark performance and job readiness defines the current frontier of agent development.
The significance extends beyond cataloging problems. VAKRA provides a framework for measurement. For the first time, researchers and developers have a structured way to track whether improvements in agent reliability are real or illusory. The taxonomy offers a diagnostic map: which failure mode dominates in a given application, and which interventions actually reduce error rates.
This matters because the gap between demo and deployment has defined the agent landscape. VAKRA does not solve that gap. But it provides the measurement infrastructure to determine whether anyone else is closing it.