Dev Tools Synthesized from 2 sources

7 AI Agents Handle Testing, Triage, and Bug Fixes Without Human Scripting

Key Points

  • Docker's Fleet runs 7 AI agent roles autonomously in CI and local terminals
  • Skills defined as role descriptions, not scripts—agents exercise judgment
  • Same skill file runs locally first (seconds/iteration) before CI deployment
  • Isolation via microVM sandboxes: own daemon, network, filesystem per agent
  • Adam CAD reports spatial reasoning jumps in GPT 5.5 and Opus 4.7
  • Human role shifts from executor to auditor of agent personas
References (2)
  1. [1] Docker Team Uses Seven AI Agents to Test and Release Products — Docker Blog
  2. [2] Adam AI CAD harness integrates with Onshape and Fusion 360 — Hacker News AI

Seven AI agents now autonomously test products, triage issues, post release notes, and fix bugs in CI at Docker's Coding Agent Sandboxes team. They call it the Fleet—and it reveals where AI development tools are heading: autonomous pipelines that humans merely audit.

The pain point was familiar to any engineering team that has scaled. Docker's sbx team needed to test across MacOS, Linux, and Windows for every release, catch resource leaks under sustained load, maintain visibility into what shipped, and triage a growing issue backlog. They could have written traditional test scripts and reporting tools. Instead, they built agent roles that handle these tasks themselves, both on developer laptops and in CI.

The Fleet runs on Claude Code skills—markdown files that define a persona, responsibilities, and permitted tools. "Think of a skill not as a script that says 'run these steps,' but as a role description that says 'you are the build engineer, here's what you know and how you make decisions,'" the team explains. That distinction matters because agents need judgment, not just instructions. When a test fails unexpectedly, a script stops. A role investigates.

The critical design principle: every skill runs locally first. When building the /cli-tester skill, the team didn't start with a GitHub workflow. They invoked it from their terminal, watched it build binaries, exercise CLI commands, find issues, and report them. They tweaked the skill until it behaved correctly. Only then did they wire it into CI. "When the skill runs locally first, the iteration takes seconds," the team notes. "You see the agent think. You see where it gets confused. You fix the skill file, re-invoke, and try again."

CI is just another runtime for the same skill. The /cli-tester that runs nightly across MacOS, Linux, and Windows is identical to the one invoked from a developer's terminal. The workflow sets up the environment, checks out the code, and calls the skill. No separate CI version. No translation layer. One skill, two runtimes.

Running inside Docker's secure microVM sandboxes, each agent gets full autonomy—its own Docker daemon, network, and filesystem—without touching the host system. This isolation enables agents to exercise real judgment rather than just following scripts, testing not just whether commands succeed, but whether the product behaves correctly under unexpected conditions.

The parallel shift is visible in mechanical engineering. Adam, a startup building AI CAD integration, reports significant jumps in spatial reasoning for GPT 5.5 and Opus 4.7, enabling model-agnostic routing that picks the best frontier model per task. Adam's harness integrates directly with Onshape and Autodesk Fusion 360, reading existing feature trees and editing them agentically—renaming features for readability, merging redundant operations, parametrizing models, or applying fillets across designs.

Both cases point to the same trajectory: AI dev tools are graduating from copilots that assist human decisions to autonomous agents that execute pipelines. The human role shifts from executor to auditor—watching the agent think, catching failures, refining the role description. Whether testing software or editing CAD geometry, the workflow is identical: define the role, run it locally until it behaves, deploy it anywhere. The skill file becomes the source of truth, not the script or the workflow.

For developers, this changes the debugging mental model entirely. You no longer trace through automation logic. You debug personas. You ask: "What does my agent believe about its job?" And you fix the description until it gets it right.

0:00