Skip to main content
ZO is built around a Lead Orchestrator (Opus by default) plus an autonomous experiment loop that can iterate up to ten times in Phase 4. That’s the right shape for an API user with budget headroom, and the wrong shape for a Pro-plan subscriber whose daily message cap is finite. Low-token mode swaps several knobs in unison so ZO fits inside a Pro budget without forcing you to learn five flags.

Activate it

zo build plans/my-project.md --low-token
Per-invocation. Today’s run only.
The banner shows a [low-token] badge whenever the preset is active so you have constant visual confirmation.

What the preset flips

KnobDefaultLow-tokenWhy
Lead modelopussonnet~5× cheaper input/output. Biggest single line item.
Sub-agent routingper .md (mostly Sonnet)two-tier: Haiku for code-reviewer / test-engineer / oracle-qa; Sonnet for reasoning (data-engineer, model-builder, xai, domain-eval, ml-engineer, customs)Pattern-matching tasks don’t need Sonnet’s lift. Haiku is ~3× cheaper and SWE-bench-competitive on code.
Phase 1 agentsfull reviewer setdata-engineer onlyReviews + tests deferred to Gate 5 final pass. Phase 1 was ~45% of the first bench’s cost.
Phase 5 agentsmodel-builder + oracle-qa + xai-agent + domain-evaluatormodel-builder + oracle-qa onlyLead writes single-shot analysis summary instead of dedicated explainability/domain pass.
Phase-4 max_iterations102Phase 4 is the dominant cost; iteration count is the multiplier.
Phase-4 stop_on_tiermust_passcould_passStops at the weakest acceptable oracle tier instead of pushing for the strongest.
Cross-cutting research-scouton every phasedroppedSaves ~6 spawns and their contracts.
End-of-session Haiku summary1 call per rundisabled~$0.0002 per run. The previous per-60-second ticker (~60 small calls/hour) was removed unconditionally; only the one-shot wrap-up remains, and --low-token skips even that.
Default gate modesupervisedfull-autoNo human-loop overhead. You can still pass --gate-mode supervised to override.
CLAUDE_AUTOCOMPACT_PCT_OVERRIDEunset (Claude Code default ~83%)60Auto-compacts conversation context earlier, prevents performance degradation near the window limit.

Measured savings

First measured bench (MNIST, 2026-04-27) landed at $7.75 vs. the historical default-mode reference of ~$11, a ~30% reduction. That bench used the lead-only swap. The preset has since added two structural levers, Haiku routing for review/test/oracle agents and per-phase agent trims (Phase 1 + Phase 5), that target a ~50-60% ceiling. A second bench post-update is needed to confirm the new measured number. The honest reason the original 70-80% projection was off: in default mode, sub-agents are already on Sonnet via their .md frontmatter. Only the Lead Orchestrator is Opus. So a lead-only swap only affects the lead’s ~30-40% cost share. At ~5× cheaper per token, that’s a ceiling of ~25-30% on the total, exactly what the first bench measured. The path past that ceiling is right-sizing pattern-matching agents (Haiku) and trimming non-essential reviewers per phase, both of which the preset now does. For a path past 50-60%, see Cost benchmark → What would push savings higher: prompt caching via SDK refactor, Batch API for parallel evaluation, Files API for repeated context. The 30% (or post-update 50-60%) headline depends on plan shape. Plans that hit the iteration cap on default mode (harder problems, multiple Phase 4 retries) see more, max_iterations 10 → 2 starts to bite. Plans with no Phase 4 (data-only, feature-engineering-only) see less, only the lead-model swap and headline disable matter (~15-20%).

Override individual knobs

The preset is a starting point, not a ceiling. Override flags compose with --low-token:
# Keep Opus lead but everything else low-token
zo build plans/my-project.md --low-token --lead-model opus

# Allow 5 iterations instead of 2 (still better than the default 10)
zo build plans/my-project.md --low-token --max-iterations 5

# Low-token but keep human-in-loop gates
zo build plans/my-project.md --low-token --gate-mode supervised

Precedence

Highest first:
  1. CLI flag: --lead-model, --max-iterations, --gate-mode
  2. Plan field: lead_model: in YAML frontmatter, ## Experiment Loop section
  3. Low-token preset: applied when --low-token or low_token: true
  4. Base default: Opus, 10 iterations, supervised, etc.
This means a plan can opt back into research-grade settings (high iteration count, Opus lead) even with low_token: true set globally. The preset is a “sensible defaults” layer, not a hard clamp.

Trade-offs

Low-token mode is a quality-for-cost trade. Be honest with yourself about which side you’re paying.
  • Lead nuance. Sonnet is excellent at execution and decomposition for well-defined plans. For research with shifting goalposts or projects whose plan needs creative interpretation, Opus catches things Sonnet misses. If you’re spinning up a new domain for the first time, run the first plan on Opus to validate, then switch to low-token for replays.
  • Iteration depth. With max_iterations=2, hard problems may not converge before the cap. ZO surfaces this in zo status so you can re-run without --low-token if the oracle wasn’t met.
  • Cross-cutting research. Without research-scout, agents don’t get a baseline literature review at every phase. For projects in a familiar domain this is fine; for genuinely novel problems, list research-scout explicitly under **Active agents:** in the plan to override.
  • No headlines. You lose the periodic 1-line “what’s happening” feed. Raw events are still in logs/comms/<date>.jsonl if you want to tail -f.

When NOT to use low-token mode

  • The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like.
  • A production launch where the cost of a poorly-converged model is far higher than the token cost of running on Opus.
  • Time-sensitive demos: no headlines means less live signal to your stakeholders.
In each case, run on default settings, validate the result, then switch on low_token: true for subsequent replays and ablations.

What’s NOT in this mode (yet)

The features below would help but require a larger architectural change (switching ZO from claude CLI launcher to direct Anthropic SDK). They’re tracked as future work, not in scope for low-token mode v1:
  • Prompt caching: Anthropic’s 5-minute TTL cache. Would save ~90% on cached input tokens.
  • Batch API: 50% discount for async jobs. Applicable to other Haiku/Sonnet calls ZO might add in future (e.g., end-of-phase reports) but incompatible with interactive tmux. The per-60-second Haiku ticker that originally motivated this note has since been removed.
  • Files API: upload static artifacts (plans, specs) once, reference by ID.
  • Extended thinking budget tuning: cap thinking tokens explicitly.

Side-by-side comparison

What changes, end-to-end, between a default run and a low-token run on the same plan:
Lifecycle stageDefaultLow-tokenSaving driver
Lead orchestrator sessionOpus, 200 turnsSonnet, 200 turnsPer-turn cost ~5× cheaper
Lead-prompt buildFull roster + dedicated adaptations section + per-agent contractsCompact roster + inline adaptations only + per-agent contracts~2-5 KB removed per phase
Phase 1-3 gatesPause for human (supervised)Auto-PROCEED (full-auto)No human-loop overhead
Phase 4 first iterationSameSame,
Phase 4 iteration cap10 attempts2 attemptsThe big multiplier
Phase 4 stop conditionmust_pass tiercould_pass tierStops earlier on weakest acceptable result
Cross-cutting research-scoutSpawned per phaseSkipped~6 spawns × ~1 KB contract
Cross-cutting code-reviewerSpawned per phaseSpawned per phase(kept, quality safety net)
End-of-session Haiku summaryGeneratedSkipped1 small call per session. (The previous per-60-second headline ticker — ~60 small calls/hour — was removed unconditionally.)
Auto-compaction trigger~83% of context window60% of context windowEarlier compaction → less degraded reasoning at the tail

Worked example: MNIST

The MNIST end-to-end demo (Phase 1 → 6, oracle threshold 95% must_pass / 99% could_pass) is the canonical reference run. Default-mode cost was ~$11, dominated by the Lead Orchestrator on Opus.
# Default run (historical reference)
zo build plans/mnist-digit-classifier.md
# Lead: Opus
# Phase-4 iterations: up to 10 (MNIST converges in 1)
# Headlines: ~60 per hour
# Cost: ~$11
# Wall time: ~50 min

# Low-token run (measured 2026-04-27)
zo build plans/mnist-digit-classifier.md --low-token
# Lead: Sonnet 4.6 (~5x cheaper per token)
# Sub-agents: Sonnet 4.6 (already, no change)
# Phase-4 iterations: capped at 2 (demo hits target on iteration 1, cap doesn't bite)
# Headlines: disabled
# Cost: $7.75 (~30% reduction)
# Wall time: ~75 min (one stuck-gate cycle in this run)
Cost breakdown of the measured run: Lead orchestrator $4.48 + sub-agents $3.27 = $7.75 total (captured via npx ccusage --instances). Phase 1 alone was $3.47, the largest single phase, Phase 4 only ran one iteration. Full bench writeup with the why-30%-not-70% breakdown: reference/cost-benchmark.

Worked example: tiny plan with no Phase 4

If your plan finishes inside Phase 3 (data-only or feature-engineering-only project), the iteration-cap savings disappear and only the lead-model swap and headline disable matter. Expected reduction: ~15-20% vs. the ~30% for a full lifecycle plan.

When to use low-token mode

Replays

A plan that has already converged on the default settings. Replay on low-token; the second run rarely needs ten iterations to find the same answer.

Pro plan / student account

Daily message caps make Opus runs untenable. Low-token’s Sonnet lead + 2-iteration cap fits inside most cap budgets.

Demos

A live demo of ZO doesn’t need ten Phase-4 iterations to make the point. Low-token’s faster wall time also makes for a better demo.

Ablations

When you’re varying a single parameter across many runs, low-token cuts the marginal cost so you can run more variants for the same spend.

When NOT to use low-token mode

  • The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like. Default mode’s higher iteration count gives you more attempts to discover what works.
  • A production launch where the cost of a poorly-converged model is far higher than the token cost.
  • Time-sensitive demos: no headlines means less live signal to your stakeholders. (Consider --low-token plus --gate-mode supervised to keep the human-in-loop signal at gates.)
In each case: run on default settings, validate the result, then switch on low_token: true for subsequent replays and ablations.

FAQ

Yes, that’s the trade. The lead orchestrator (Opus → Sonnet) is the biggest quality reduction, but Sonnet 4.6 is excellent at execution and decomposition for well-defined plans. Where it falls down is open-ended creative interpretation of an under-specified plan. The Phase-4 iteration cap (10 → 2) means hard problems get fewer attempts to converge, zo status surfaces this so you can re-run without --low-token if the oracle wasn’t met.
Yes. The override flags compose: zo build plans/x.md --low-token --lead-model opus keeps Opus lead and applies the rest of the preset (max iterations, no headlines, full-auto, earlier compaction). Useful when the lead model matters for plan decomposition but you want the iteration savings.
The autonomous loop stops with BUDGET_EXHAUSTED. The phase remains in a state where you can re-run without --low-token (or with a higher --max-iterations) and the loop picks up from where it left off, child experiments inherit parent_id so the lineage is preserved. You don’t lose the work, you just unlock more iterations.
Almost. The plan field activates the same preset. The only difference: CLI flags ALWAYS win over plan fields. So a plan with low_token: true AND a --lead-model opus flag runs Opus lead but everything else low-token.
The research-scout agent provides cross-cutting baseline literature review. It’s useful but not essential, its absence rarely changes the outcome of a specific phase, and dropping it saves ~6 spawns and their contracts per build. Code-reviewer is kept for quality safety; research-scout is the safer drop. If you genuinely need research-scout active in low-token mode, add it explicitly to your plan’s **Active agents:** block, but note that the orchestrator filters research-scout regardless of the active list when low_token=True (this is documented as a known limitation; an opt-out mechanism may land in v2 if requested).
zo continue accepts the same flags as zo build and forwards them. zo draft does NOT yet support --low-token, drafting uses Opus + 100 max-turns by default. Adding low-token support to draft is tracked as a follow-up; the savings are smaller anyway because draft is shorter than build.
Yes, sub-agents are on Sonnet 4.6 in both default mode and low-token mode. Default mode never had sub-agents on Opus; their .md frontmatter declares model: claude-sonnet-4-6. So switching to --low-token doesn’t change the sub-agent cost rate. This is also the reason the savings ceiling is ~30% rather than the 70-80% an “everyone-was-Opus” projection would imply.The lead orchestrator’s prompt in low-token mode does include a Sub-Agent Model Override section that explicitly passes model="claude-sonnet-4-6" to every Agent() call as a defence-in-depth measure (in case some future Claude Code version changes the default). Verified working end-to-end on Claude Code 2.1.107: ps aux during the measured bench showed all four processes (lead + 3 sub-agents) on claude-sonnet-4-6.
Not yet. ZO doesn’t ship a built-in token meter (the claude CLI logs tokens to ~/.claude/projects/*.jsonl but ZO doesn’t surface them). The Optional integrations (planned) section in the README lists ccusage, a Claude Code token usage monitor, as a near-term opt-in for zo usage. For now: run npx ccusage --json after a build to see per-session totals.
Haiku 4.5 is excellent on coding (SWE-bench 73.3%, rivalling Sonnet 4) and very cheap. But the Lead Orchestrator’s job is multi-step orchestration, contract reasoning, gate evaluation, and team coordination, Haiku struggles with multi-step planning under uncertainty. Sonnet is the safer default for v1. You can manually opt into Haiku lead with --lead-model haiku if you want to push the savings further (probably 10-15× cheaper than Opus, but with material quality risk).
By default, low-token sets --gate-mode full-auto. All gates auto-PROCEED when artifact contracts and oracle thresholds pass. To keep human-in-loop gates while still saving on lead/iterations, pass --gate-mode supervised explicitly: zo build plans/x.md --low-token --gate-mode supervised. The CLI flag wins over the preset’s full-auto default.
Several Anthropic API features could provide additional savings but require switching ZO from the claude CLI launcher to direct Anthropic SDK calls, out of scope for low-token v1:
  • Prompt caching: 5-minute TTL cache. Would save ~90% on cached input tokens.
  • Batch API: 50% discount for async jobs. Applicable to other Haiku/Sonnet calls ZO might add in future (e.g., end-of-phase reports) but incompatible with interactive tmux. The per-60-second Haiku ticker that originally motivated this note has since been removed.
  • Files API: upload static artifacts (plans, specs) once, reference by ID.
  • Extended thinking budget tuning: cap thinking tokens explicitly.
Tracked as future work; would be a separate, larger architectural change.

See also