field notes

The Spec Is Where Agent Cost Starts

Agent spend usually starts before the first tool call, in unclear acceptance criteria.

May 20264 min read

Agent cost has a first-order cause that teams keep treating as a footnote: the task was not shaped enough to be testable. The invoice shows up as model tokens, tool calls, CI minutes, and review time, but the run usually became expensive before the agent opened a terminal. It became expensive when a ticket said "clean this up" and nobody could say what done meant.

You can see it in a normal sprint. A technical lead drops a ticket into the queue after a messy planning call: "clean this up before launch." The agent checks out the branch, opens the same three files twice, searches for similar components, edits a helper, backs out the edit, and asks whether the behavior should stay compatible with the old path. CI fails on a snapshot that no one mentioned. A human reads the diff and realizes the issue was never about cleanup. It was about one broken state transition and a missing acceptance criterion.

That is not an agent failure. It is a spec failure with an agent-shaped receipt.

Good specs reduce search space. Bad specs push search into the most expensive part of the system: the autonomous run. When the task says "make onboarding better," the agent has to infer product intent, inspect surrounding code, guess the desired state, decide which tests matter, and discover constraints by hitting walls. Each guess has a cost. Each wall creates another tool call, another re-read, another patch, another failed test.

The better unit is not a longer ticket. The better unit is a smaller contract. A useful agent task tells the system what observable change should exist after the run, how to verify it, and what territory is out of bounds. "Clean this up" becomes "move duplicate invite validation into the shared schema, preserve the existing error strings, and prove it with the invite form tests." That sentence is not bureaucratic. It turns a vague desire into a bounded job.

Acceptance criteria are where the budget starts. If the criteria are missing, the agent spends money discovering them. It reads unrelated files because it cannot tell what matters. It retries commands because it does not know which command is authoritative. It over-edits because there is no explicit blast radius. It under-edits because it sees risk but not priority. Then a reviewer asks for the same correction the ticket could have stated.

The most expensive tickets often look harmless. They are small enough that nobody pauses to write them properly. "Fix flaky auth test." "Tighten this flow." "Make the table match the new design." The problem is not size; it is ambiguity. A narrow task with unclear criteria can waste more agent time than a larger task with clear verification. The larger task lets the agent plan. The vague task makes it perform archaeology.

CI failures are a good diagnostic. Some red builds are legitimate discoveries. Many are avoidable invoices for missing instructions. The agent updates a component but does not update the fixture. It changes a message but misses the snapshot. It runs the wrong test command because the repo has three similar scripts. The terminal log scrolls past errors that a better task could have prevented with one line: "run bunx tsc and the invite form test before proposing the diff."

The same pattern shows up in repeated file opens. When an agent re-opens the same files again and again, it is often not being careless. It is trying to reconstruct intent from code because the ticket did not carry enough intent. It is comparing names, imports, test coverage, and old commits to answer a question the spec should have answered: what behavior are we changing, and how will we know it worked?

This is why agent cost work belongs upstream. Rate limits and cheaper models matter, but they do not fix undefined work. A cheaper run that retries three times is still waste. A faster agent that ships the wrong patch is still a human review burden. The leverage is in making the task cheaper to reason about before the first tool call.

A practical spec for agent work needs four parts. First, the observable outcome: what the user, API, CLI, or reviewer should see. Second, the constraints: files, interfaces, copy, security rules, compatibility requirements, or ownership boundaries that must hold. Third, the verification path: commands, tests, screenshots, logs, or manual checks that count as evidence. Fourth, the stopping rule: what should make the agent ask for clarification instead of inventing product policy.

This does not require a heavyweight process. It requires a habit. Before assigning the task, ask whether a new engineer could tell when the work is done without reading your mind. If the answer is no, the agent will pay for that gap with exploration. Then the reviewer will pay for it again with comments. Then CI may pay for it a third time with red runs that teach nothing.

The best agent teams treat specs as cost controls. They do not write perfect documents. They write enough context to keep autonomy pointed at the work. They give the agent a harness, a boundary, and a proof target. The result is not just lower spend. It is cleaner diffs, fewer retries, and less human time spent discovering that the task was never testable.

If your agent bill feels high, start by reading the tickets attached to the expensive runs. Look for vague verbs, missing commands, unclear acceptance criteria, and review comments that repeat what should have been in the task. The savings usually begin there: not in making the agent cheaper, but in making the work legible before the agent starts.