Skip to main content
ZO is built around a Lead Orchestrator (Opus by default) plus an autonomous experiment loop that can iterate up to ten times in Phase 4. That’s the right shape for an API user with budget headroom — and the wrong shape for a Pro-plan subscriber whose daily message cap is finite. Low-token mode swaps several knobs in unison so ZO fits inside a Pro budget without forcing you to learn five flags.

Activate it

zo build plans/my-project.md --low-token
Per-invocation. Today’s run only.
The banner shows a [low-token] badge whenever the preset is active so you have constant visual confirmation.

What the preset flips

KnobDefaultLow-tokenWhy
Lead modelopussonnet~5× cheaper input/output. Biggest single line item.
Phase-4 max_iterations102Phase 4 is the dominant cost; iteration count is the multiplier.
Phase-4 stop_on_tiermust_passcould_passStops at the weakest acceptable oracle tier instead of pushing for the strongest.
Cross-cutting research-scouton every phasedroppedSaves ~6 spawns and their contracts. code-reviewer is kept — silent quality drift is worse than the saved tokens.
Haiku headline tickerevery 60sdisabled~60 small calls/hour. Small individually, cumulative across long runs.
Default gate modesupervisedfull-autoNo human-loop overhead. You can still pass --gate-mode supervised to override.
CLAUDE_AUTOCOMPACT_PCT_OVERRIDEunset (Claude Code default ~83%)60Auto-compacts conversation context earlier, prevents performance degradation near the window limit.

Estimated savings

The MNIST end-to-end demo cost ~11ondefaultsettings.Withlowtokenthesamerunlandsat 11 on default settings. With `--low-token` the same run lands at **~2-3** — a 5-8× reduction. Most of the saving comes from the lead model swap and the iteration cap; the rest is incremental. The numbers depend heavily on plan complexity. A plan with no Phase-4 experiments will see a smaller relative saving (closer to 2-3×, driven entirely by the lead model swap). A plan that hits the iteration cap will see the largest saving.

Override individual knobs

The preset is a starting point, not a ceiling. Override flags compose with --low-token:
# Keep Opus lead but everything else low-token
zo build plans/my-project.md --low-token --lead-model opus

# Allow 5 iterations instead of 2 (still better than the default 10)
zo build plans/my-project.md --low-token --max-iterations 5

# Low-token but keep human-in-loop gates
zo build plans/my-project.md --low-token --gate-mode supervised

Precedence

Highest first:
  1. CLI flag--lead-model, --max-iterations, --gate-mode
  2. Plan fieldlead_model: in YAML frontmatter, ## Experiment Loop section
  3. Low-token preset — applied when --low-token or low_token: true
  4. Base default — Opus, 10 iterations, supervised, etc.
This means a plan can opt back into research-grade settings (high iteration count, Opus lead) even with low_token: true set globally. The preset is a “sensible defaults” layer, not a hard clamp.

Trade-offs

Low-token mode is a quality-for-cost trade. Be honest with yourself about which side you’re paying.
  • Lead nuance. Sonnet is excellent at execution and decomposition for well-defined plans. For research with shifting goalposts or projects whose plan needs creative interpretation, Opus catches things Sonnet misses. If you’re spinning up a new domain for the first time, run the first plan on Opus to validate, then switch to low-token for replays.
  • Iteration depth. With max_iterations=2, hard problems may not converge before the cap. ZO surfaces this in zo status so you can re-run without --low-token if the oracle wasn’t met.
  • Cross-cutting research. Without research-scout, agents don’t get a baseline literature review at every phase. For projects in a familiar domain this is fine; for genuinely novel problems, list research-scout explicitly under **Active agents:** in the plan to override.
  • No headlines. You lose the periodic 1-line “what’s happening” feed. Raw events are still in logs/comms/<date>.jsonl if you want to tail -f.

When NOT to use low-token mode

  • The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like.
  • A production launch where the cost of a poorly-converged model is far higher than the token cost of running on Opus.
  • Time-sensitive demos — no headlines means less live signal to your stakeholders.
In each case, run on default settings, validate the result, then switch on low_token: true for subsequent replays and ablations.

What’s NOT in this mode (yet)

The features below would help but require a larger architectural change (switching ZO from claude CLI launcher to direct Anthropic SDK). They’re tracked as future work, not in scope for low-token mode v1:
  • Prompt caching — Anthropic’s 5-minute TTL cache. Would save ~90% on cached input tokens.
  • Batch API — 50% discount for async jobs. Suitable for the Haiku ticker but incompatible with interactive tmux.
  • Files API — upload static artifacts (plans, specs) once, reference by ID.
  • Extended thinking budget tuning — cap thinking tokens explicitly.

Side-by-side comparison

What changes, end-to-end, between a default run and a low-token run on the same plan:
Lifecycle stageDefaultLow-tokenSaving driver
Lead orchestrator sessionOpus, 200 turnsSonnet, 200 turnsPer-turn cost ~5× cheaper
Lead-prompt buildFull roster + dedicated adaptations section + per-agent contractsCompact roster + inline adaptations only + per-agent contracts~2-5 KB removed per phase
Phase 1-3 gatesPause for human (supervised)Auto-PROCEED (full-auto)No human-loop overhead
Phase 4 first iterationSameSame
Phase 4 iteration cap10 attempts2 attemptsThe big multiplier
Phase 4 stop conditionmust_pass tiercould_pass tierStops earlier on weakest acceptable result
Cross-cutting research-scoutSpawned per phaseSkipped~6 spawns × ~1 KB contract
Cross-cutting code-reviewerSpawned per phaseSpawned per phase(kept — quality safety net)
Haiku headline tickerEvery 60sDisabled~60 small calls/hour
End-of-session Haiku summaryGeneratedSkipped1 small call per session
Auto-compaction trigger~83% of context window60% of context windowEarlier compaction → less degraded reasoning at the tail

Worked example: MNIST

The MNIST end-to-end demo (Phase 1 → 6, oracle threshold 95% must_pass / 99% could_pass) is the canonical reference run. Default-mode cost was ~$11, dominated by the Lead Orchestrator on Opus through ten Phase-4 iterations. Low-token expectations:
# Default run
zo build plans/mnist-digit-classifier.md
# Lead: Opus
# Phase-4 iterations: up to 10
# Headlines: ~60 per hour
# Estimated cost: ~$11
# Estimated wall time: ~50 min

# Low-token run
zo build plans/mnist-digit-classifier.md --low-token
# Lead: Sonnet (~5x cheaper)
# Phase-4 iterations: capped at 2 (the demo hits target on iteration 1)
# Headlines: disabled
# Estimated cost: ~$2-3
# Estimated wall time: ~25 min
The empirical numbers from a controlled benchmark are tracked in reference/cost-benchmark once the comparison run completes.

Worked example: tiny plan with no Phase 4

If your plan finishes inside Phase 3 (data-only or feature-engineering-only project), the iteration-cap savings disappear — only the lead-model swap and headline disable matter. Expected reduction: ~2-3× instead of 5-8×. Still useful, just smaller in absolute terms.

When to use low-token mode

Replays

A plan that has already converged on the default settings. Replay on low-token; the second run rarely needs ten iterations to find the same answer.

Pro plan / student account

Daily message caps make Opus runs untenable. Low-token’s Sonnet lead + 2-iteration cap fits inside most cap budgets.

Demos

A live demo of ZO doesn’t need ten Phase-4 iterations to make the point. Low-token’s faster wall time also makes for a better demo.

Ablations

When you’re varying a single parameter across many runs, low-token cuts the marginal cost so you can run more variants for the same spend.

When NOT to use low-token mode

  • The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like. Default mode’s higher iteration count gives you more attempts to discover what works.
  • A production launch where the cost of a poorly-converged model is far higher than the token cost.
  • Time-sensitive demos — no headlines means less live signal to your stakeholders. (Consider --low-token plus --gate-mode supervised to keep the human-in-loop signal at gates.)
In each case: run on default settings, validate the result, then switch on low_token: true for subsequent replays and ablations.

FAQ

Yes — that’s the trade. The lead orchestrator (Opus → Sonnet) is the biggest quality reduction, but Sonnet 4.6 is excellent at execution and decomposition for well-defined plans. Where it falls down is open-ended creative interpretation of an under-specified plan. The Phase-4 iteration cap (10 → 2) means hard problems get fewer attempts to converge — zo status surfaces this so you can re-run without --low-token if the oracle wasn’t met.
Yes. The override flags compose: zo build plans/x.md --low-token --lead-model opus keeps Opus lead and applies the rest of the preset (max iterations, no headlines, full-auto, earlier compaction). Useful when the lead model matters for plan decomposition but you want the iteration savings.
The autonomous loop stops with BUDGET_EXHAUSTED. The phase remains in a state where you can re-run without --low-token (or with a higher --max-iterations) and the loop picks up from where it left off — child experiments inherit parent_id so the lineage is preserved. You don’t lose the work, you just unlock more iterations.
Almost. The plan field activates the same preset. The only difference: CLI flags ALWAYS win over plan fields. So a plan with low_token: true AND a --lead-model opus flag runs Opus lead but everything else low-token.
The research-scout agent provides cross-cutting baseline literature review. It’s useful but not essential — its absence rarely changes the outcome of a specific phase, and dropping it saves ~6 spawns and their contracts per build. Code-reviewer is kept for quality safety; research-scout is the safer drop. If you genuinely need research-scout active in low-token mode, add it explicitly to your plan’s **Active agents:** block — but note that the orchestrator filters research-scout regardless of the active list when low_token=True (this is documented as a known limitation; an opt-out mechanism may land in v2 if requested).
zo continue accepts the same flags as zo build and forwards them. zo draft does NOT yet support --low-token — drafting uses Opus + 100 max-turns by default. Adding low-token support to draft is tracked as a follow-up; the savings are smaller anyway because draft is shorter than build.
Not yet. ZO doesn’t ship a built-in token meter (the claude CLI logs tokens to ~/.claude/projects/*.jsonl but ZO doesn’t surface them). The Optional integrations (planned) section in the README lists ccusage — a Claude Code token usage monitor — as a near-term opt-in for zo usage. For now: run npx ccusage --json after a build to see per-session totals.
Haiku 4.5 is excellent on coding (SWE-bench 73.3%, rivalling Sonnet 4) and very cheap. But the Lead Orchestrator’s job is multi-step orchestration, contract reasoning, gate evaluation, and team coordination — Haiku struggles with multi-step planning under uncertainty. Sonnet is the safer default for v1. You can manually opt into Haiku lead with --lead-model haiku if you want to push the savings further (probably 10-15× cheaper than Opus, but with material quality risk).
By default, low-token sets --gate-mode full-auto. All gates auto-PROCEED when artifact contracts and oracle thresholds pass. To keep human-in-loop gates while still saving on lead/iterations, pass --gate-mode supervised explicitly: zo build plans/x.md --low-token --gate-mode supervised. The CLI flag wins over the preset’s full-auto default.
Several Anthropic API features could provide additional savings but require switching ZO from the claude CLI launcher to direct Anthropic SDK calls — out of scope for low-token v1:
  • Prompt caching — 5-minute TTL cache. Would save ~90% on cached input tokens.
  • Batch API — 50% discount for async jobs. Suitable for the Haiku ticker but incompatible with interactive tmux.
  • Files API — upload static artifacts (plans, specs) once, reference by ID.
  • Extended thinking budget tuning — cap thinking tokens explicitly.
Tracked as future work; would be a separate, larger architectural change.

See also