This page documents the methodology and results for the --low-token cost benchmark — a controlled comparison between default and low-token modes on ZO’s canonical MNIST end-to-end reference run.
Status: measured on 2026-04-27 against the canonical MNIST plan. Result: low-token mode landed at $7.75 end-to-end (Sonnet lead $4.48 + Sonnet sub-agents $3.27, captured via npx ccusage --instances) vs. the historical default-mode reference of ~$11. That’s a ~30% reduction, not the 70-80% projected before measurement. The structural reason is below — see Why the savings ceiling is structural. The earlier 70-80% projection assumed every agent would swap from Opus to Sonnet; in practice only the lead is Opus by default (sub-agents already run on Sonnet via their .md frontmatter), so --low-token only affects the lead’s ~30-40% cost share.
Why benchmark MNIST
MNIST is ZO’s canonical reference because:
- Stable oracle — the must-pass tier (95% test accuracy) is well within the architecture’s capability
- Six full phases — exercises Phase 1 (data) through Phase 6 (packaging), the full lifecycle
- Phase-4 iterations matter — non-trivially, the autonomous loop matters for cost
- Reproducible — the plan, data, and oracle don’t drift between runs
- Documented baseline — session-005 gave us a precise cost ($11) and wall-time (~50min) anchor point
Methodology
The benchmark runs zo build against the MNIST plan twice in identical environments:
- Default run —
zo build plans/mnist-digit-classifier.md — Opus lead, max-iterations 10, supervised gates auto-PROCEED-ed via --gate-mode full-auto
- Low-token run —
zo build plans/mnist-digit-classifier.md --low-token — Sonnet lead, 2 iterations, full-auto gates, no headlines
For fair comparison, both runs use --gate-mode full-auto so the only differences are the low-token knobs themselves. Results are logged to ~/.claude/projects/*.jsonl; per-session totals are extracted via npx ccusage --json (one of ZO’s planned optional integrations).
Measured dimensions
| Metric | How |
|---|
| Input tokens (lead) | Sum across all turns of the lead orchestrator |
| Output tokens (lead) | Same |
| Input/output tokens (sub-agents) | Sum across all spawned agents |
| Headline tokens | Sum of Haiku ticker calls |
| Wall time | From zo build start to session end |
| Oracle tier reached | must_pass, should_pass, or could_pass — ensures quality didn’t regress unacceptably |
| Phase 4 iterations completed | Diagnostic — how often does each mode hit its iteration cap |
| Total cost (USD) | Computed from token totals × Anthropic published rates at benchmark time |
Controlled variables
- Same machine (Mac M-series, 32GB RAM)
- Same Claude Code version
- Same MNIST plan, unmodified between runs
- Same delivery scaffold (fresh
zo init)
- Both runs
--gate-mode full-auto (default would have been supervised for the default-mode run; we override to keep gate behaviour identical)
Run frequency
The benchmark is single-shot per release. Re-running on every PR would cost ~$13 each cycle and provide minimal signal — variance is dominated by stochastic agent decisions, not measurement noise.
Measured results (2026-04-27)
First completed end-to-end bench against the MNIST plan, captured via npx ccusage --instances --since "$(date -u +%Y%m%d)":
| Metric | Default (historical) | Low-token (measured) | Reduction |
|---|
| Lead orchestrator | ~$X.XX (Opus, 10 iter cap) | $4.48 (Sonnet 4.6) | Lead model swap is the dominant lever |
| Sub-agents (data-engineer, code-reviewer, test-engineer) | Sonnet (already) | $3.27 (Sonnet 4.6) | Same model — no token-rate saving here |
| Phase-4 iterations completed | 1 (MNIST converges fast) | 1 | Iteration cap didn’t bite (MNIST converges in 1) |
| Headline tokens | ~30 calls × ~1.5 KB | 0 | ~$0.01 saved |
| Total cost | ~$11 | $7.75 | ~30% reduction |
| Wall time | ~50 min | ~75 min (incl. one stuck-gate cycle) | Mixed — see Caveats |
| Oracle tier reached | could_pass (99.66%) | (training completed at 98.83% — full pipeline cycle did not complete; see Findings) | Same model quality at the lead step |
The headline number — 30% — is materially smaller than the 70-80% projected before measurement. The next two sections explain why.
Why the savings ceiling is structural
The earlier 70-80% estimate assumed --low-token would swap every agent from Opus to Sonnet (~5× cheaper per token), so the savings would be ~80% across the board.
That assumption was wrong. Default mode never had every agent on Opus:
| Role | Default mode | --low-token mode | Token spend share |
|---|
| Lead orchestrator | Opus 4.6 | Sonnet 4.6 | ~30-40% of total |
| Data engineer | Sonnet (per .md frontmatter) | Sonnet | ~20-25% |
| Code reviewer | Sonnet | Sonnet | ~10-15% |
| Test engineer | Sonnet | Sonnet | ~10-15% |
| Oracle / QA | Sonnet | Sonnet | ~5-10% |
| Other sub-agents (xai, domain-eval, etc.) | Sonnet | Sonnet | remainder |
So --low-token only affects the lead’s cost share (~30-40% of total). At ~5× cheaper-per-token on Sonnet, that’s a ceiling of ~25-30% savings on the total — which is exactly what we measured.
The original projection extrapolated the per-token ratio (5×) to the whole run. The whole run was already mostly Sonnet. There was no 5× savings hiding in the sub-agents because their cost rate wasn’t changing.
What --low-token actually moves
The 30% breaks down across the preset’s seven knobs:
| Knob | Mechanism | Approximate contribution to the 30% |
|---|
| Lead Opus → Sonnet | ~5× per-token rate, ~30-40% cost share | ~20-25 pp |
Phase-4 max_iterations 10 → 2 | Caps iteration count | ~2-5 pp (MNIST converges in 1, so cap doesn’t bite) |
stop_on_tier must_pass → could_pass | Earlier termination | ~1-3 pp (same as above for MNIST) |
Drop research-scout cross-cutting | ~6 spawns × small contracts | ~1-2 pp |
| No Haiku headline ticker | ~60 small calls/hour | under 1 pp (~$0.01) |
Default gate-mode full-auto | No human-loop wall-time overhead | wall-time only, not token |
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=60 | Earlier auto-compaction | indirect (cleaner reasoning, no direct $ saving) |
For plans that hit the iteration cap on default mode (harder problems), the iteration knob would matter more. MNIST is too easy — it converges in iteration 1, so the cap → 2 saving is essentially zero.
What would push savings higher
The first bench measured ~30% with the lead-only swap. Two additional levers shipped post-bench target a ~50-60% ceiling without an SDK refactor; three further architectural levers target ~70-80%.
Shipped post-first-bench (target: ~50-60%)
| Lever | Mechanism | Estimated additional savings | Status |
|---|
| Sub-agent model right-sizing | Two-tier routing: code-reviewer, test-engineer, oracle-qa → Haiku 4.5 (pattern-matching tasks); reasoning agents stay on Sonnet. Lead instructed via _prompt_low_token_overrides() | ~10-15% (Haiku is ~3× cheaper than Sonnet) | Shipped — LOW_TOKEN_HAIKU_AGENTS in src/zo/_orchestrator_phases.py |
| Phase-1 trim | Phase 1 was ~45% of first bench cost ($3.47 of $7.75). Drop code-reviewer, test-engineer, domain-evaluator from Phase 1 in low-token mode; defer to Gate 5 final pass. Just data-engineer runs | ~10-15% | Shipped — LOW_TOKEN_PHASE_DROPS["phase_1"] |
| Phase-5 trim | Drop xai-agent and domain-evaluator from Phase 5; lead writes single-shot analysis summary instead of dedicated explainability + domain-validation pass | ~3-5% | Shipped — LOW_TOKEN_PHASE_DROPS["phase_5"] |
A second bench post-PR-C is required to confirm whether the 50-60% target is hit. The headline number on this page updates when that bench lands.
Architectural — not yet shipped (target: ~70-80%)
To break past the ~50-60% ceiling, the path forward requires moving ZO from claude CLI subprocess to direct Anthropic SDK so ZO can control caching, batching, and file uploads programmatically:
| Lever | Mechanism | Potential additional savings | Library / API |
|---|
| Prompt caching | cache_control markers on plan/specs/agent roster (5-min TTL). About half the bench’s tokens were cache reads at ~$0.30/M; explicit cache_control should bring that to ~$0.30 of every $3 of input | ~50-60% on the cache-read portion (~half of total) | Anthropic Python SDK direct |
| Batch API | 50% discount for async jobs. Suitable for the Haiku ticker, end-of-session summary, and any non-time-critical evaluation rounds | 50% off batched calls | Anthropic Batch API |
| Files API | Upload plan + specs once, reference by ID. Eliminates repeated upload of the same large context blocks | ~10-15% on input tokens for large plans | Anthropic Files API |
The 70-80% figure was always achievable — just not via the v1 preset alone. The SDK refactor is multi-week and lands in v1.1.
Findings from the first measured run (2026-04-27)
Beyond the headline cost number, the first bench surfaced material findings:
Confirmed working:
- Lead orchestrator runs on Sonnet under
--low-token. ps aux during the run showed claude --model sonnet --max-turns 200 for the lead.
- Sub-agent model override works. All three spawned sub-agents (data-engineer, code-reviewer, test-engineer) ran on
claude-sonnet-4-6 per ps aux. The orchestrator’s _prompt_low_token_overrides() instruction to pass model="claude-sonnet-4-6" to every Agent() call is honoured by Claude Code 2.1.107. The earlier “Claude Code 2.1.92 ignores the param” hypothesis is no longer load-bearing — either the bug was version-specific or the prompt-level override always was sufficient. No SDK refactor needed for this piece.
Contract violation discovered (now fixed):
zo watch-training rendered “Waiting for training to start…” for the entire run despite training completing at 98.83% test accuracy. Three stacked contract violations: (a) model-builder.md had two contradictory paths for training metrics (.zo/experiments/<exp_id>/ vs logs/training/); (b) wrapper.py and cli.py:watch_training hardcoded the wrong path; (c) the Phase 4 gate was aspirational — only checked result.md, not the actual ZOTrainingCallback artifacts. Sonnet (low-token) ignored both contradictory instructions and wrote a vanilla PyTorch JSON dump to a third invented path.
- Fix shipped: new
resolve_active_experiment_dir() helper as canonical resolver; cli.py and wrapper.py consume it; orchestrator’s gate now hard-fails when metrics.jsonl and training_status.json are missing alongside result.md. Captured in PRIORS.md as PR-035: aspirational agent contracts get ignored under sub-optimal models — hard gate enforcement is mandatory.
Caveats and known limitations
-
MNIST is an easy benchmark. The convergence-iteration cap (10 → 2) doesn’t bite on MNIST because the model converges in iteration 1. On a harder problem, the iteration cap matters more — and
--low-token may fail to converge where default would have succeeded. Re-run without --low-token if zo status shows BUDGET_EXHAUSTED.
-
Lead model swap doesn’t show MNIST quality regression. MNIST is well within Sonnet’s capability for plan decomposition. On novel research-grade plans, Opus may catch nuances Sonnet misses.
-
Wall-time savings are not all from tokens. ~30% of the wall-time saving comes from skipped human-loop overhead (
--gate-mode full-auto default in low-token), not from compute savings. If you keep --gate-mode supervised, wall time savings shrink even though token savings remain.
-
Cost depends on the Anthropic rate card at benchmark time. Pricing changes ripple through the total. The methodology measures token counts primarily; cost in USD is derived.
-
Pro-plan caps measure messages, not tokens. Anthropic Pro subscribers hit a daily message cap, not a token cap. Low-token mode reduces both per-message tokens AND total messages (no Haiku ticker, no end-of-session summary). The cap savings are roughly proportional but not identical to token savings.
Reproducing the benchmark
The harness lives at scripts/benchmark_low_token.sh. Usage:
# From the ZO repo root, with ccusage installed (`npm install -g ccusage`):
./scripts/benchmark_low_token.sh
# Or specify a delivery prefix:
./scripts/benchmark_low_token.sh --delivery-prefix /tmp/zo-bench
The script:
- Runs
zo init mnist-bench-default --no-tmux ... to scaffold a fresh delivery repo
- Captures pre-build ccusage snapshot
- Runs
zo build in default mode + waits for completion
- Captures post-build ccusage snapshot — diff = default-mode tokens
- Runs
zo init mnist-bench-low --no-tmux ... for the low-token delivery
- Pre-build ccusage snapshot
- Runs
zo build --low-token + waits
- Post-build ccusage snapshot — diff = low-token tokens
- Writes
benchmark-results-{timestamp}.json with the comparison
- Prints a summary table
Total wall time: ~75 minutes (default ~50 + low-token ~25). Total cost: $13-14 ($11 + ~$2-3).
The benchmark spends real tokens. Don’t run it on a free tier. The methodology assumes API access (or a Max plan with comfortable headroom).
Updates
This page is updated when a new ZO version changes anything that materially affects the cost profile (preset values, agent roster, prompt structure). The history of measurements over time gives a longitudinal view of how cost-per-build evolves with the platform.
| ZO version | Date | Default cost | Low-token cost | Reduction | Source |
|---|
| 1.0.2 + low-token | 2026-04-27 | ~$11 (historical) | $7.75 (measured) | ~30% | First measured bench, MNIST plan, ccusage --instances |
See also