Skip to main content
This page documents the methodology and results for the --low-token cost benchmark — a controlled comparison between default and low-token modes on ZO’s canonical MNIST end-to-end reference run.
Status: measured on 2026-04-27 against the canonical MNIST plan. Result: low-token mode landed at $7.75 end-to-end (Sonnet lead $4.48 + Sonnet sub-agents $3.27, captured via npx ccusage --instances) vs. the historical default-mode reference of ~$11. That’s a ~30% reduction, not the 70-80% projected before measurement. The structural reason is below — see Why the savings ceiling is structural. The earlier 70-80% projection assumed every agent would swap from Opus to Sonnet; in practice only the lead is Opus by default (sub-agents already run on Sonnet via their .md frontmatter), so --low-token only affects the lead’s ~30-40% cost share.

Why benchmark MNIST

MNIST is ZO’s canonical reference because:
  • Stable oracle — the must-pass tier (95% test accuracy) is well within the architecture’s capability
  • Six full phases — exercises Phase 1 (data) through Phase 6 (packaging), the full lifecycle
  • Phase-4 iterations matter — non-trivially, the autonomous loop matters for cost
  • Reproducible — the plan, data, and oracle don’t drift between runs
  • Documented baseline — session-005 gave us a precise cost ($11) and wall-time (~50min) anchor point

Methodology

The benchmark runs zo build against the MNIST plan twice in identical environments:
  1. Default runzo build plans/mnist-digit-classifier.md — Opus lead, max-iterations 10, supervised gates auto-PROCEED-ed via --gate-mode full-auto
  2. Low-token runzo build plans/mnist-digit-classifier.md --low-token — Sonnet lead, 2 iterations, full-auto gates, no headlines
For fair comparison, both runs use --gate-mode full-auto so the only differences are the low-token knobs themselves. Results are logged to ~/.claude/projects/*.jsonl; per-session totals are extracted via npx ccusage --json (one of ZO’s planned optional integrations).

Measured dimensions

MetricHow
Input tokens (lead)Sum across all turns of the lead orchestrator
Output tokens (lead)Same
Input/output tokens (sub-agents)Sum across all spawned agents
Headline tokensSum of Haiku ticker calls
Wall timeFrom zo build start to session end
Oracle tier reachedmust_pass, should_pass, or could_pass — ensures quality didn’t regress unacceptably
Phase 4 iterations completedDiagnostic — how often does each mode hit its iteration cap
Total cost (USD)Computed from token totals × Anthropic published rates at benchmark time

Controlled variables

  • Same machine (Mac M-series, 32GB RAM)
  • Same Claude Code version
  • Same MNIST plan, unmodified between runs
  • Same delivery scaffold (fresh zo init)
  • Both runs --gate-mode full-auto (default would have been supervised for the default-mode run; we override to keep gate behaviour identical)

Run frequency

The benchmark is single-shot per release. Re-running on every PR would cost ~$13 each cycle and provide minimal signal — variance is dominated by stochastic agent decisions, not measurement noise.

Measured results (2026-04-27)

First completed end-to-end bench against the MNIST plan, captured via npx ccusage --instances --since "$(date -u +%Y%m%d)":
MetricDefault (historical)Low-token (measured)Reduction
Lead orchestrator~$X.XX (Opus, 10 iter cap)$4.48 (Sonnet 4.6)Lead model swap is the dominant lever
Sub-agents (data-engineer, code-reviewer, test-engineer)Sonnet (already)$3.27 (Sonnet 4.6)Same model — no token-rate saving here
Phase-4 iterations completed1 (MNIST converges fast)1Iteration cap didn’t bite (MNIST converges in 1)
Headline tokens~30 calls × ~1.5 KB0~$0.01 saved
Total cost~$11$7.75~30% reduction
Wall time~50 min~75 min (incl. one stuck-gate cycle)Mixed — see Caveats
Oracle tier reachedcould_pass (99.66%)(training completed at 98.83% — full pipeline cycle did not complete; see Findings)Same model quality at the lead step
The headline number — 30% — is materially smaller than the 70-80% projected before measurement. The next two sections explain why.

Why the savings ceiling is structural

The earlier 70-80% estimate assumed --low-token would swap every agent from Opus to Sonnet (~5× cheaper per token), so the savings would be ~80% across the board. That assumption was wrong. Default mode never had every agent on Opus:
RoleDefault mode--low-token modeToken spend share
Lead orchestratorOpus 4.6Sonnet 4.6~30-40% of total
Data engineerSonnet (per .md frontmatter)Sonnet~20-25%
Code reviewerSonnetSonnet~10-15%
Test engineerSonnetSonnet~10-15%
Oracle / QASonnetSonnet~5-10%
Other sub-agents (xai, domain-eval, etc.)SonnetSonnetremainder
So --low-token only affects the lead’s cost share (~30-40% of total). At ~5× cheaper-per-token on Sonnet, that’s a ceiling of ~25-30% savings on the total — which is exactly what we measured. The original projection extrapolated the per-token ratio (5×) to the whole run. The whole run was already mostly Sonnet. There was no 5× savings hiding in the sub-agents because their cost rate wasn’t changing.

What --low-token actually moves

The 30% breaks down across the preset’s seven knobs:
KnobMechanismApproximate contribution to the 30%
Lead Opus → Sonnet~5× per-token rate, ~30-40% cost share~20-25 pp
Phase-4 max_iterations 10 → 2Caps iteration count~2-5 pp (MNIST converges in 1, so cap doesn’t bite)
stop_on_tier must_passcould_passEarlier termination~1-3 pp (same as above for MNIST)
Drop research-scout cross-cutting~6 spawns × small contracts~1-2 pp
No Haiku headline ticker~60 small calls/hourunder 1 pp (~$0.01)
Default gate-mode full-autoNo human-loop wall-time overheadwall-time only, not token
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60Earlier auto-compactionindirect (cleaner reasoning, no direct $ saving)
For plans that hit the iteration cap on default mode (harder problems), the iteration knob would matter more. MNIST is too easy — it converges in iteration 1, so the cap → 2 saving is essentially zero.

What would push savings higher

The first bench measured ~30% with the lead-only swap. Two additional levers shipped post-bench target a ~50-60% ceiling without an SDK refactor; three further architectural levers target ~70-80%.

Shipped post-first-bench (target: ~50-60%)

LeverMechanismEstimated additional savingsStatus
Sub-agent model right-sizingTwo-tier routing: code-reviewer, test-engineer, oracle-qa → Haiku 4.5 (pattern-matching tasks); reasoning agents stay on Sonnet. Lead instructed via _prompt_low_token_overrides()~10-15% (Haiku is ~3× cheaper than Sonnet)ShippedLOW_TOKEN_HAIKU_AGENTS in src/zo/_orchestrator_phases.py
Phase-1 trimPhase 1 was ~45% of first bench cost ($3.47 of $7.75). Drop code-reviewer, test-engineer, domain-evaluator from Phase 1 in low-token mode; defer to Gate 5 final pass. Just data-engineer runs~10-15%ShippedLOW_TOKEN_PHASE_DROPS["phase_1"]
Phase-5 trimDrop xai-agent and domain-evaluator from Phase 5; lead writes single-shot analysis summary instead of dedicated explainability + domain-validation pass~3-5%ShippedLOW_TOKEN_PHASE_DROPS["phase_5"]
A second bench post-PR-C is required to confirm whether the 50-60% target is hit. The headline number on this page updates when that bench lands.

Architectural — not yet shipped (target: ~70-80%)

To break past the ~50-60% ceiling, the path forward requires moving ZO from claude CLI subprocess to direct Anthropic SDK so ZO can control caching, batching, and file uploads programmatically:
LeverMechanismPotential additional savingsLibrary / API
Prompt cachingcache_control markers on plan/specs/agent roster (5-min TTL). About half the bench’s tokens were cache reads at ~$0.30/M; explicit cache_control should bring that to ~$0.30 of every $3 of input~50-60% on the cache-read portion (~half of total)Anthropic Python SDK direct
Batch API50% discount for async jobs. Suitable for the Haiku ticker, end-of-session summary, and any non-time-critical evaluation rounds50% off batched callsAnthropic Batch API
Files APIUpload plan + specs once, reference by ID. Eliminates repeated upload of the same large context blocks~10-15% on input tokens for large plansAnthropic Files API
The 70-80% figure was always achievable — just not via the v1 preset alone. The SDK refactor is multi-week and lands in v1.1.

Findings from the first measured run (2026-04-27)

Beyond the headline cost number, the first bench surfaced material findings: Confirmed working:
  • Lead orchestrator runs on Sonnet under --low-token. ps aux during the run showed claude --model sonnet --max-turns 200 for the lead.
  • Sub-agent model override works. All three spawned sub-agents (data-engineer, code-reviewer, test-engineer) ran on claude-sonnet-4-6 per ps aux. The orchestrator’s _prompt_low_token_overrides() instruction to pass model="claude-sonnet-4-6" to every Agent() call is honoured by Claude Code 2.1.107. The earlier “Claude Code 2.1.92 ignores the param” hypothesis is no longer load-bearing — either the bug was version-specific or the prompt-level override always was sufficient. No SDK refactor needed for this piece.
Contract violation discovered (now fixed):
  • zo watch-training rendered “Waiting for training to start…” for the entire run despite training completing at 98.83% test accuracy. Three stacked contract violations: (a) model-builder.md had two contradictory paths for training metrics (.zo/experiments/<exp_id>/ vs logs/training/); (b) wrapper.py and cli.py:watch_training hardcoded the wrong path; (c) the Phase 4 gate was aspirational — only checked result.md, not the actual ZOTrainingCallback artifacts. Sonnet (low-token) ignored both contradictory instructions and wrote a vanilla PyTorch JSON dump to a third invented path.
  • Fix shipped: new resolve_active_experiment_dir() helper as canonical resolver; cli.py and wrapper.py consume it; orchestrator’s gate now hard-fails when metrics.jsonl and training_status.json are missing alongside result.md. Captured in PRIORS.md as PR-035: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory.

Caveats and known limitations

  1. MNIST is an easy benchmark. The convergence-iteration cap (10 → 2) doesn’t bite on MNIST because the model converges in iteration 1. On a harder problem, the iteration cap matters more — and --low-token may fail to converge where default would have succeeded. Re-run without --low-token if zo status shows BUDGET_EXHAUSTED.
  2. Lead model swap doesn’t show MNIST quality regression. MNIST is well within Sonnet’s capability for plan decomposition. On novel research-grade plans, Opus may catch nuances Sonnet misses.
  3. Wall-time savings are not all from tokens. ~30% of the wall-time saving comes from skipped human-loop overhead (--gate-mode full-auto default in low-token), not from compute savings. If you keep --gate-mode supervised, wall time savings shrink even though token savings remain.
  4. Cost depends on the Anthropic rate card at benchmark time. Pricing changes ripple through the total. The methodology measures token counts primarily; cost in USD is derived.
  5. Pro-plan caps measure messages, not tokens. Anthropic Pro subscribers hit a daily message cap, not a token cap. Low-token mode reduces both per-message tokens AND total messages (no Haiku ticker, no end-of-session summary). The cap savings are roughly proportional but not identical to token savings.

Reproducing the benchmark

The harness lives at scripts/benchmark_low_token.sh. Usage:
# From the ZO repo root, with ccusage installed (`npm install -g ccusage`):
./scripts/benchmark_low_token.sh

# Or specify a delivery prefix:
./scripts/benchmark_low_token.sh --delivery-prefix /tmp/zo-bench
The script:
  1. Runs zo init mnist-bench-default --no-tmux ... to scaffold a fresh delivery repo
  2. Captures pre-build ccusage snapshot
  3. Runs zo build in default mode + waits for completion
  4. Captures post-build ccusage snapshot — diff = default-mode tokens
  5. Runs zo init mnist-bench-low --no-tmux ... for the low-token delivery
  6. Pre-build ccusage snapshot
  7. Runs zo build --low-token + waits
  8. Post-build ccusage snapshot — diff = low-token tokens
  9. Writes benchmark-results-{timestamp}.json with the comparison
  10. Prints a summary table
Total wall time: ~75 minutes (default ~50 + low-token ~25). Total cost: $13-14 ($11 + ~$2-3).
The benchmark spends real tokens. Don’t run it on a free tier. The methodology assumes API access (or a Max plan with comfortable headroom).

Updates

This page is updated when a new ZO version changes anything that materially affects the cost profile (preset values, agent roster, prompt structure). The history of measurements over time gives a longitudinal view of how cost-per-build evolves with the platform.
ZO versionDateDefault costLow-token costReductionSource
1.0.2 + low-token2026-04-27~$11 (historical)$7.75 (measured)~30%First measured bench, MNIST plan, ccusage --instances

See also