agentrail / benchmarks / run-set 52026-05-26

Run-set 5.

Three coding agents — Cursor Composer 2.5, Claude Opus 4.7 (via Claude Code), and GPT-5.5 (via Codex) — on five real software-engineering tasks, full issue-to-ship lifecycle, two lanes per task. All totals below are summed across the run-set.

tl;dr

cursor 2.5

−60.0% non-cache tokens

8.80M → 3.52M

opus 4.7

−21.3% cost

$48.93 → $38.53

gpt-5.5

−21.0% cost

$8.02 → $6.34

01 / tasks

Five real Ansible tasks.

Drawn from the SWE-bench-Pro hard set against the ansible/ansible repository. Each task runs the full 14-phase lifecycle: issue intake → plan → implement → CI → review → ship.

task

files

loc

T-01

Password lookup + encrypt: fix subjects_alt_name handling

+143 / −89

T-02

YAML dumper: propagate filter trust through templating

+129 / −19

T-03

Linux facts: correct nproc/cpuinfo on container hosts

+73 / −1

T-04

mount_facts module: introduce structured mount inventory

+1622 / −0

T-05

Module respawn + SELinux compat shim (libselinux removal)

+649 / −304

02 / lanes

Control vs AgentRail.

control

The agent owns every phase: intake, triage, plan, implement, CI watch, fix CI, review, revise, ship. Tokens are charged for every prompt the agent issues across the loop.

agentrail

The agent is invoked only for the phases AgentRail's lifecycle leaves to it (predominantly implement and revise). Other phases are handled by AgentRail and incur no agent-side spend.

Both lanes share the same model, same repository state, and same acceptance criteria. They differ only in which phases the agent runs — mirroring the deployment decision in production, not a controlled lab study.

03 / results

Every metric, every cell.

3 runners × 2 lanes

runner

metric

control

agentrail

Δ%

Cursor Composer 2.5

non-cache tokens

8,800,719

3,520,245

−5,280,474

−60.0%

cache-inclusive tokens

100,969,521

77,663,576

−23,305,945

−23.1%

input tokens

8,524,884

3,312,805

−5,212,079

−61.1%

output tokens

275,835

207,440

−68,395

−24.8%

Claude Opus 4.7

dollar cost (USD)

$48.93

$38.53

−$10.40

−21.3%

cache-inclusive tokens

49,367,189

45,202,488

−4,164,701

−8.4%

cache creation

2,729,630

1,388,016

−1,341,614

−49.2%

cache read

46,374,169

43,575,956

−2,798,213

−6.0%

non-cache tokens

263,390

238,516

−24,874

−9.4%

GPT-5.5

dollar cost (USD)

~$8.02

~$6.34

−$1.68

−21.0%

total tokens

27,392,589

24,228,912

−3,163,677

−11.5%

input tokens

27,131,547

24,003,102

−3,128,445

−11.5%

cached input

24,862,976

22,605,952

−2,257,024

−9.1%

output tokens

207,314

176,400

−30,914

−14.9%

03b / paired bars

Control vs AgentRail, per metric.

ink = control · green = agentrail · normalised per row

Cursor Composer 2.5

Cursor · native agent−60.0% non-cache tokens

non-cache tokens

control

8,800,719

agentrail

−60.0%

3,520,245

cache-inclusive tokens

control

100,969,521

agentrail

−23.1%

77,663,576

input tokens

control

8,524,884

agentrail

−61.1%

3,312,805

output tokens

control

275,835

agentrail

−24.8%

207,440

run_2026-05-25T18-15-35-314Z

Claude Opus 4.7

Anthropic · via Claude Code−21.3% dollar cost

dollar cost (USD)

control

$48.93

agentrail

−21.3%

$38.53

cache-inclusive tokens

control

49,367,189

agentrail

−8.4%

45,202,488

cache creation

control

2,729,630

agentrail

−49.2%

1,388,016

cache read

control

46,374,169

agentrail

−6.0%

43,575,956

non-cache tokens

control

263,390

agentrail

−9.4%

238,516

run_phase_opus47_5_20260525_213304

GPT-5.5

OpenAI · via Codex−21.0% dollar cost

dollar cost (USD)

control

~$8.02

agentrail

−21.0%

~$6.34

total tokens

control

27,392,589

agentrail

−11.5%

24,228,912

input tokens

control

27,131,547

agentrail

−11.5%

24,003,102

cached input

control

24,862,976

agentrail

−9.1%

22,605,952

output tokens

control

207,314

agentrail

−14.9%

176,400

run_phase_gpt55_5_20260525_212153

04 / per-runner notes

Why the deltas differ.

Cursor

Cursor Composer 2.5

Non-cache tokens are the headline figure for Cursor — they correspond to billable activity outside the prompt-cache hit path. Composer 2.5 is plan-billed; per-token dollar cost is not directly comparable to API providers.

run_2026-05-25T18-15-35-314Z

Anthropic

Claude Opus 4.7

Cost reduction outpaces raw token reduction because AgentRail shifts work from cache-creation (priced higher) toward cache-read. Direct dollar figures come from Anthropic's billed rates at run time.

run_phase_opus47_5_20260525_213304

OpenAI

GPT-5.5

Output-token reductions matter most here — output is the priciest unit on GPT-5.5 and the one most sensitive to scaffolding overhead. Dollar cost computed at OpenAI's published GPT-5-class rates.

run_phase_gpt55_5_20260525_212153

05 / metric definitions

What each number means.

non-cache tokens: Tokens billed at the model's full input rate — i.e. not served from the provider's prompt cache. The cleanest signal of net new work the model had to do. Where AgentRail's lifecycle routing has the largest effect.
cache-inclusive tokens: Every token the agent sent or received, including cache reads. Larger absolute numbers and smaller percentage deltas — context-window-bound, not cost-bound.
cache creation: First-time write of a prompt prefix into the provider's cache. Priced higher than cache-read for Anthropic models; AgentRail's stable role briefs cut repeat creations.
cache read: Reuse of a previously cached prefix. Discounted; counts toward cache-inclusive but not non-cache. Higher is generally better — it means the cache is doing its job.
output tokens: Tokens the model generated. Priced highest on most providers. Disproportionate driver of cost on GPT-5.5 and a tight proxy for agent verbosity.
dollar cost: Provider-reported USD spend, summed across all five tasks in the run-set. For Claude Opus, computed from Anthropic's billed rates at the time of the run. For GPT-5.5, computed from OpenAI's published GPT-5-class rates applied to the captured token mix.

06 / reproduce

Run it yourself.

Every prompt, every task harness, and every run id above is in the benchmark repo. Numbers should reproduce within ±2% on a fresh sample.

$ git clone https://github.com/oxnw/agentrail
$ cd agentrail/benchmarks/agentrail-swe-lifecycle
$ npm run benchmark:phase-attribution -- --runner claude-code --count 5

byok · BUSL-1.1

Back to home