AI Coding AgentsAgent ReliabilityDeveloper ToolingToken Efficiency

Your Coding Agent Doesn't Have a Model Problem — It Has a Harness Problem

10 Jun 2026

Developers keep swapping models chasing reliability gains, but the dominant signal from engineering communities in 2026 is that coding agent performance is a harness-design problem — and AgentRail is the structured harness that closes the loop.

In April 2026, Martin Fowler published a piece on harness engineering, arguing that your coding agent's reliability has less to do with which model you're running and everything to do with the constraints, feedback loops, and quality gates you wrap around it. He's right. And most teams are still running their agents without any of that.

If you've spent the past year switching between Claude, Codex, Gemini, and whatever came out this week trying to find the one that finally works reliably, this post is about why that strategy has a ceiling. The model is not the bottleneck. The scaffolding is.

The model-switching trap

A viral thread on r/AI_Agents put it bluntly: "Stop celebrating agentic workflows until you fix the 60% failure rate." That 60% figure is not about weak models. It's about absent structure. The thread links failure directly to missing structured state management and feedback loops, noting that the same models that fail in naive agentic setups perform reliably when wrapped in proper scaffolding.

This matches what engineers are reporting from production. The DEV Community's May 2026 meta-analysis of developer conversations found that the dominant theme had shifted from "which model?" to "what scaffolding?" Reliability had overtaken novelty as the primary concern. On r/LocalLLaMA, a post with around 487 upvotes documented a dramatic improvement in agent reliability attributed entirely to a "plan-first skill file forcing structured execution," with the author concluding: "Agent performance is increasingly a harness-design problem, not a weights problem."

Three harness gaps that compound

The failures tend to cluster around the same structural absences. The first is no structured issue intake. When an agent receives a vague task description, it guesses at scope. It over-reaches on some files and misses others. Tim Sylvester, writing after six months of daily agentic coding, documented this as one of six core failure modes: scope drift and file-reading avoidance, where the agent guesses rather than reads because the task description didn't constrain it to the right files.

The second gap is no CI feedback loop. Augment Code's technical analysis showed that naive agent loops have O(N^2) token costs because LLMs re-bill the entire conversation history each call. A 20-step loop generating 1,000 tokens per step produces 210,000 cumulative input tokens, not 20,000. SWE-bench analysis found that 39.9 to 59.7 percent of tool-result tokens are removable with zero performance loss. The agent is paying, literally, for context it doesn't need. Without a structured CI ingestion layer, that noise accumulates every turn.

The third gap is no PR guardrails. Without quality gates at submission time, errors that should have been caught at CI compound into review threads, rework cycles, and broken branches. Each gap by itself is expensive. Together they create the cascading failures that make people reach for a different model, when what they actually need is a different structure.

What a real harness looks like

The pattern that works is: structured issue intake that constrains scope before the agent starts, CI result ingestion that normalises test output into actionable signals before it re-enters the agent context, and PR submission with quality gates that catch problems at the boundary rather than after review. Sylvester's manual fix, a structured checklist prepended to every session, works because it enforces this pattern. It's also brittle, per-session, and entirely manual.

AgentRail automates the harness as a control plane. One structured API covers the full dev loop: issue intake, routing to the right agent (Claude Code, Codex, Cursor), PR submission, CI feedback, review, shipping. The 47% reduction in total tokens and 93% reduction in reasoning tokens are downstream effects of the harness, not of any change to the underlying model. The model runs cleaner when the structure around it is doing its job.

Local-first and source-available. You can be up in two commands: npm install -g @agentrail-core/cli && agentrail init. More at https://agentrail.app.