Run-set 5.
Three coding agents — Cursor Composer 2.5, Claude Opus 4.7 (via Claude Code), and GPT-5.5 (via Codex) — on five real software-engineering tasks, full issue-to-ship lifecycle, two lanes per task. All totals below are summed across the run-set.
Five real Ansible tasks.
Drawn from the SWE-bench-Pro hard set against the ansible/ansible repository. Each task runs the full 14-phase lifecycle: issue intake → plan → implement → CI → review → ship.
Control vs AgentRail.
Both lanes share the same model, same repository state, and same acceptance criteria. They differ only in which phases the agent runs — mirroring the deployment decision in production, not a controlled lab study.
Every metric, every cell.
3 runners × 2 lanes
Control vs AgentRail, per metric.
ink = control · green = agentrail · normalised per row
Cursor Composer 2.5
Cursor · native agent−60.0% non-cache tokensClaude Opus 4.7
Anthropic · via Claude Code−21.3% dollar costGPT-5.5
OpenAI · via Codex−21.0% dollar costWhy the deltas differ.
What each number means.
- non-cache tokens
- Tokens billed at the model's full input rate — i.e. not served from the provider's prompt cache. The cleanest signal of net new work the model had to do. Where AgentRail's lifecycle routing has the largest effect.
- cache-inclusive tokens
- Every token the agent sent or received, including cache reads. Larger absolute numbers and smaller percentage deltas — context-window-bound, not cost-bound.
- cache creation
- First-time write of a prompt prefix into the provider's cache. Priced higher than cache-read for Anthropic models; AgentRail's stable role briefs cut repeat creations.
- cache read
- Reuse of a previously cached prefix. Discounted; counts toward cache-inclusive but not non-cache. Higher is generally better — it means the cache is doing its job.
- output tokens
- Tokens the model generated. Priced highest on most providers. Disproportionate driver of cost on GPT-5.5 and a tight proxy for agent verbosity.
- dollar cost
- Provider-reported USD spend, summed across all five tasks in the run-set. For Claude Opus, computed from Anthropic's billed rates at the time of the run. For GPT-5.5, computed from OpenAI's published GPT-5-class rates applied to the captured token mix.
Run it yourself.
Every prompt, every task harness, and every run id above is in the benchmark repo. Numbers should reproduce within ±2% on a fresh sample.
$ git clone https://github.com/oxnw/agentrail $ cd agentrail/benchmarks/agentrail-swe-lifecycle $ npm run benchmark:phase-attribution -- --runner claude-code --count 5