field notes

Agents Need Harnesses, Not Supervision

The leverage is in tests, scripts, policies, and feedback loops, not watching every step.

May 20264 min read

Agentic development does not become reliable because someone watches every step. It becomes reliable when the agent has a harness: tests that matter, scripts that work, policies that are available before the change, and feedback loops that turn mistakes into reusable constraints. Supervision can catch drift. A harness reduces the amount of drift that reaches a human.

The unproductive version is easy to recognize. A lead opens the agent session and watches terminal output scroll. The install script fails because the local setup needs an undocumented environment variable. The agent asks which command runs type checks. A security scan happens after the diff is already shaped. The lead pastes instructions into chat that should have existed as tools. Thirty minutes later, everyone is tired, and the agent is still waiting for the next instruction.

That is not autonomy. That is remote control with extra latency.

Supervision feels responsible because it keeps a human close to the work. It is useful in unfamiliar code, high-risk migrations, and exploratory debugging. But as the default operating model, it destroys the point of agents. If a human has to approve every command, explain every repo convention, and interpret every failure, the team has moved labor from coding into watching. The interface changed; the dependency did not.

Harnesses change the dependency. A good harness tells the agent how to start, how to check itself, where the boundaries are, and what evidence matters. It replaces "ask me before you proceed" with "run these checks, obey these constraints, and stop when the evidence conflicts." The human remains accountable, but the system carries more of the routine guidance.

Tests are the most obvious harness, but only if they are usable. A test suite that fails locally for unrelated reasons is not a harness; it is fog. A command that works only on one engineer's laptop is not a harness; it is tribal knowledge. Agents need commands that are discoverable, deterministic enough to trust, and scoped enough to run during the task. If the only verification path is "ask Sam which tests matter," the team has not built an autonomous workflow.

Scripts matter for the same reason. Repos often contain a dozen commands that almost work: one for CI, one for local development, one old path in a README, one package manager command in a Slack thread, and one script that silently skips a package. Humans learn the difference through pain. Agents pay for it through retries. A harness names the authoritative commands and makes broken commands visibly broken.

Policies belong in the harness too. Security, data access, dependency rules, logging expectations, and ownership boundaries should not arrive as comments after the patch. If agents regularly touch code that handles tokens, PII, billing, or deployment, the run should start with those constraints. The agent should know which files require extra care, which patterns are forbidden, and which checks must run before review.

This is where many teams confuse monitoring with control. Watching the terminal is monitoring. It can tell you something went wrong. It does not make the next run better unless the lesson becomes part of the system. If the agent fails because setup was brittle, fix the setup script or document the required environment where the agent can read it. If it uses the wrong command, expose the right command. If it misses a security rule, encode the rule closer to the run.

Feedback loops are the part that makes harnesses compound. Every failed run should leave behind a sharper boundary, a better command, a more specific test, or a clearer task template. Otherwise the organization pays for the same lesson again. The goal is not to prevent all mistakes. The goal is to prevent the same mistake from staying conversational.

Consider the difference between two teams handling the same agent failure. On one team, the agent opens a PR that passes type checks but misses an authorization guard. A reviewer catches it and writes, "Need auth here." The agent patches the file. The comment disappears into the closed PR. Two weeks later, the same thing happens elsewhere.

On the other team, the reviewer still catches the issue, but the response is different. The team adds a test fixture for the authorization path, updates the task checklist for protected routes, and gives the agent a policy note about server-side access checks. The next run has a better chance of catching the issue before review. The human judgment became infrastructure.

That is the operating principle: convert supervision into harness. When a human intervenes, ask whether the intervention was a one-off judgment or a reusable constraint. One-off judgment can stay human. Reusable constraints should move into tests, scripts, docs, templates, linters, or policy files that agents actually consume.

This also makes agents easier to trust. Trust does not come from believing the model is careful. It comes from knowing the run has rails, evidence, and stopping conditions. A lead can review the output with a clearer question: did the agent satisfy the harness, and do I agree with the tradeoffs? That is a better review posture than scanning every terminal line for signs of trouble.

The practical starting point is small. Write down the commands a good agent should run before proposing a patch. Make the local setup reproducible. Add a short policy file for sensitive areas. Turn recurring review comments into checks or task criteria. Require the agent to report evidence, not confidence.

Agents do not need a manager hovering over every keystroke. They need the same thing good engineering teams have always needed: fast feedback, clear contracts, reliable tools, and constraints that show up before the mistake. Build that harness, and supervision becomes the exception instead of the workflow.