AgentRailv0.1.0
agentrail / benchmarks / run-set 52026-05-26

Run-set 5.

Three coding agents — Cursor Composer 2.5, Claude Opus 4.7 (via Claude Code), and GPT-5.5 (via Codex) — on five real software-engineering tasks, full issue-to-ship lifecycle, two lanes per task. All totals below are summed across the run-set.

tl;dr
cursor 2.5
−60.0% non-cache tokens
8.80M → 3.52M
opus 4.7
−21.3% cost
$48.93 → $38.53
gpt-5.5
−21.0% cost
$8.02 → $6.34
01 / tasks

Five real Ansible tasks.

Drawn from the SWE-bench-Pro hard set against the ansible/ansible repository. Each task runs the full 14-phase lifecycle: issue intake → plan → implement → CI → review → ship.

id
task
files
loc
T-01
Password lookup + encrypt: fix subjects_alt_name handling
5
+143 / −89
T-02
YAML dumper: propagate filter trust through templating
6
+129 / −19
T-03
Linux facts: correct nproc/cpuinfo on container hosts
3
+73 / −1
T-04
mount_facts module: introduce structured mount inventory
7
+1622 / −0
T-05
Module respawn + SELinux compat shim (libselinux removal)
13
+649 / −304
02 / lanes

Control vs AgentRail.

control

The agent owns every phase: intake, triage, plan, implement, CI watch, fix CI, review, revise, ship. Tokens are charged for every prompt the agent issues across the loop.

agentrail

The agent is invoked only for the phases AgentRail's lifecycle leaves to it (predominantly implement and revise). Other phases are handled by AgentRail and incur no agent-side spend.

Both lanes share the same model, same repository state, and same acceptance criteria. They differ only in which phases the agent runs — mirroring the deployment decision in production, not a controlled lab study.

03 / results

Every metric, every cell.

3 runners × 2 lanes

runner
metric
control
agentrail
Δ
Δ%
Cursor Composer 2.5
non-cache tokens
8,800,719
3,520,245
−5,280,474
−60.0%
·
cache-inclusive tokens
100,969,521
77,663,576
−23,305,945
−23.1%
·
input tokens
8,524,884
3,312,805
−5,212,079
−61.1%
·
output tokens
275,835
207,440
−68,395
−24.8%
Claude Opus 4.7
dollar cost (USD)
$48.93
$38.53
−$10.40
−21.3%
·
cache-inclusive tokens
49,367,189
45,202,488
−4,164,701
−8.4%
·
cache creation
2,729,630
1,388,016
−1,341,614
−49.2%
·
cache read
46,374,169
43,575,956
−2,798,213
−6.0%
·
non-cache tokens
263,390
238,516
−24,874
−9.4%
GPT-5.5
dollar cost (USD)
~$8.02
~$6.34
−$1.68
−21.0%
·
total tokens
27,392,589
24,228,912
−3,163,677
−11.5%
·
input tokens
27,131,547
24,003,102
−3,128,445
−11.5%
·
cached input
24,862,976
22,605,952
−2,257,024
−9.1%
·
output tokens
207,314
176,400
−30,914
−14.9%
03b / paired bars

Control vs AgentRail, per metric.

ink = control · green = agentrail · normalised per row

01

Cursor Composer 2.5

Cursor · native agent−60.0% non-cache tokens
non-cache tokens
control
8,800,719
agentrail
−60.0%
3,520,245
cache-inclusive tokens
control
100,969,521
agentrail
−23.1%
77,663,576
input tokens
control
8,524,884
agentrail
−61.1%
3,312,805
output tokens
control
275,835
agentrail
−24.8%
207,440
run_2026-05-25T18-15-35-314Z
02

Claude Opus 4.7

Anthropic · via Claude Code−21.3% dollar cost
dollar cost (USD)
control
$48.93
agentrail
−21.3%
$38.53
cache-inclusive tokens
control
49,367,189
agentrail
−8.4%
45,202,488
cache creation
control
2,729,630
agentrail
−49.2%
1,388,016
cache read
control
46,374,169
agentrail
−6.0%
43,575,956
non-cache tokens
control
263,390
agentrail
−9.4%
238,516
run_phase_opus47_5_20260525_213304
03

GPT-5.5

OpenAI · via Codex−21.0% dollar cost
dollar cost (USD)
control
~$8.02
agentrail
−21.0%
~$6.34
total tokens
control
27,392,589
agentrail
−11.5%
24,228,912
input tokens
control
27,131,547
agentrail
−11.5%
24,003,102
cached input
control
24,862,976
agentrail
−9.1%
22,605,952
output tokens
control
207,314
agentrail
−14.9%
176,400
run_phase_gpt55_5_20260525_212153
04 / per-runner notes

Why the deltas differ.

Cursor
Cursor Composer 2.5

Non-cache tokens are the headline figure for Cursor — they correspond to billable activity outside the prompt-cache hit path. Composer 2.5 is plan-billed; per-token dollar cost is not directly comparable to API providers.

run_2026-05-25T18-15-35-314Z
Anthropic
Claude Opus 4.7

Cost reduction outpaces raw token reduction because AgentRail shifts work from cache-creation (priced higher) toward cache-read. Direct dollar figures come from Anthropic's billed rates at run time.

run_phase_opus47_5_20260525_213304
OpenAI
GPT-5.5

Output-token reductions matter most here — output is the priciest unit on GPT-5.5 and the one most sensitive to scaffolding overhead. Dollar cost computed at OpenAI's published GPT-5-class rates.

run_phase_gpt55_5_20260525_212153
05 / metric definitions

What each number means.

non-cache tokens
Tokens billed at the model's full input rate — i.e. not served from the provider's prompt cache. The cleanest signal of net new work the model had to do. Where AgentRail's lifecycle routing has the largest effect.
cache-inclusive tokens
Every token the agent sent or received, including cache reads. Larger absolute numbers and smaller percentage deltas — context-window-bound, not cost-bound.
cache creation
First-time write of a prompt prefix into the provider's cache. Priced higher than cache-read for Anthropic models; AgentRail's stable role briefs cut repeat creations.
cache read
Reuse of a previously cached prefix. Discounted; counts toward cache-inclusive but not non-cache. Higher is generally better — it means the cache is doing its job.
output tokens
Tokens the model generated. Priced highest on most providers. Disproportionate driver of cost on GPT-5.5 and a tight proxy for agent verbosity.
dollar cost
Provider-reported USD spend, summed across all five tasks in the run-set. For Claude Opus, computed from Anthropic's billed rates at the time of the run. For GPT-5.5, computed from OpenAI's published GPT-5-class rates applied to the captured token mix.
06 / reproduce

Run it yourself.

Every prompt, every task harness, and every run id above is in the benchmark repo. Numbers should reproduce within ±2% on a fresh sample.

$ git clone https://github.com/oxnw/agentrail
$ cd agentrail/benchmarks/agentrail-swe-lifecycle
$ npm run benchmark:phase-attribution -- --runner claude-code --count 5
byok · BUSL-1.1