Activate it
- CLI flag
- Plan field
[low-token] badge whenever the preset is active so you have constant visual confirmation.
What the preset flips
| Knob | Default | Low-token | Why |
|---|---|---|---|
| Lead model | opus | sonnet | ~5× cheaper input/output. Biggest single line item. |
| Sub-agent routing | per .md (mostly Sonnet) | two-tier: Haiku for code-reviewer / test-engineer / oracle-qa; Sonnet for reasoning (data-engineer, model-builder, xai, domain-eval, ml-engineer, customs) | Pattern-matching tasks don’t need Sonnet’s lift. Haiku is ~3× cheaper and SWE-bench-competitive on code. |
| Phase 1 agents | full reviewer set | data-engineer only | Reviews + tests deferred to Gate 5 final pass. Phase 1 was ~45% of the first bench’s cost. |
| Phase 5 agents | model-builder + oracle-qa + xai-agent + domain-evaluator | model-builder + oracle-qa only | Lead writes single-shot analysis summary instead of dedicated explainability/domain pass. |
Phase-4 max_iterations | 10 | 2 | Phase 4 is the dominant cost; iteration count is the multiplier. |
Phase-4 stop_on_tier | must_pass | could_pass | Stops at the weakest acceptable oracle tier instead of pushing for the strongest. |
Cross-cutting research-scout | on every phase | dropped | Saves ~6 spawns and their contracts. |
| End-of-session Haiku summary | 1 call per run | disabled | ~$0.0002 per run. The previous per-60-second ticker (~60 small calls/hour) was removed unconditionally; only the one-shot wrap-up remains, and --low-token skips even that. |
| Default gate mode | supervised | full-auto | No human-loop overhead. You can still pass --gate-mode supervised to override. |
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE | unset (Claude Code default ~83%) | 60 | Auto-compacts conversation context earlier, prevents performance degradation near the window limit. |
Measured savings
First measured bench (MNIST, 2026-04-27) landed at $7.75 vs. the historical default-mode reference of ~$11, a ~30% reduction. That bench used the lead-only swap. The preset has since added two structural levers, Haiku routing for review/test/oracle agents and per-phase agent trims (Phase 1 + Phase 5), that target a ~50-60% ceiling. A second bench post-update is needed to confirm the new measured number. The honest reason the original 70-80% projection was off: in default mode, sub-agents are already on Sonnet via their.md frontmatter. Only the Lead Orchestrator is Opus. So a lead-only swap only affects the lead’s ~30-40% cost share. At ~5× cheaper per token, that’s a ceiling of ~25-30% on the total, exactly what the first bench measured. The path past that ceiling is right-sizing pattern-matching agents (Haiku) and trimming non-essential reviewers per phase, both of which the preset now does.
For a path past 50-60%, see Cost benchmark → What would push savings higher: prompt caching via SDK refactor, Batch API for parallel evaluation, Files API for repeated context.
The 30% (or post-update 50-60%) headline depends on plan shape. Plans that hit the iteration cap on default mode (harder problems, multiple Phase 4 retries) see more, max_iterations 10 → 2 starts to bite. Plans with no Phase 4 (data-only, feature-engineering-only) see less, only the lead-model swap and headline disable matter (~15-20%).
Override individual knobs
The preset is a starting point, not a ceiling. Override flags compose with--low-token:
Precedence
Highest first:- CLI flag:
--lead-model,--max-iterations,--gate-mode - Plan field:
lead_model:in YAML frontmatter,## Experiment Loopsection - Low-token preset: applied when
--low-tokenorlow_token: true - Base default: Opus, 10 iterations, supervised, etc.
low_token: true set globally. The preset is a “sensible defaults” layer, not a hard clamp.
Trade-offs
- Lead nuance. Sonnet is excellent at execution and decomposition for well-defined plans. For research with shifting goalposts or projects whose plan needs creative interpretation, Opus catches things Sonnet misses. If you’re spinning up a new domain for the first time, run the first plan on Opus to validate, then switch to low-token for replays.
- Iteration depth. With
max_iterations=2, hard problems may not converge before the cap. ZO surfaces this inzo statusso you can re-run without--low-tokenif the oracle wasn’t met. - Cross-cutting research. Without
research-scout, agents don’t get a baseline literature review at every phase. For projects in a familiar domain this is fine; for genuinely novel problems, listresearch-scoutexplicitly under**Active agents:**in the plan to override. - No headlines. You lose the periodic 1-line “what’s happening” feed. Raw events are still in
logs/comms/<date>.jsonlif you want totail -f.
When NOT to use low-token mode
- The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like.
- A production launch where the cost of a poorly-converged model is far higher than the token cost of running on Opus.
- Time-sensitive demos: no headlines means less live signal to your stakeholders.
low_token: true for subsequent replays and ablations.
What’s NOT in this mode (yet)
The features below would help but require a larger architectural change (switching ZO fromclaude CLI launcher to direct Anthropic SDK). They’re tracked as future work, not in scope for low-token mode v1:
- Prompt caching: Anthropic’s 5-minute TTL cache. Would save ~90% on cached input tokens.
- Batch API: 50% discount for async jobs. Applicable to other Haiku/Sonnet calls ZO might add in future (e.g., end-of-phase reports) but incompatible with interactive tmux. The per-60-second Haiku ticker that originally motivated this note has since been removed.
- Files API: upload static artifacts (plans, specs) once, reference by ID.
- Extended thinking budget tuning: cap thinking tokens explicitly.
Side-by-side comparison
What changes, end-to-end, between a default run and a low-token run on the same plan:| Lifecycle stage | Default | Low-token | Saving driver |
|---|---|---|---|
| Lead orchestrator session | Opus, 200 turns | Sonnet, 200 turns | Per-turn cost ~5× cheaper |
| Lead-prompt build | Full roster + dedicated adaptations section + per-agent contracts | Compact roster + inline adaptations only + per-agent contracts | ~2-5 KB removed per phase |
| Phase 1-3 gates | Pause for human (supervised) | Auto-PROCEED (full-auto) | No human-loop overhead |
| Phase 4 first iteration | Same | Same | , |
| Phase 4 iteration cap | 10 attempts | 2 attempts | The big multiplier |
| Phase 4 stop condition | must_pass tier | could_pass tier | Stops earlier on weakest acceptable result |
Cross-cutting research-scout | Spawned per phase | Skipped | ~6 spawns × ~1 KB contract |
Cross-cutting code-reviewer | Spawned per phase | Spawned per phase | (kept, quality safety net) |
| End-of-session Haiku summary | Generated | Skipped | 1 small call per session. (The previous per-60-second headline ticker — ~60 small calls/hour — was removed unconditionally.) |
| Auto-compaction trigger | ~83% of context window | 60% of context window | Earlier compaction → less degraded reasoning at the tail |
Worked example: MNIST
The MNIST end-to-end demo (Phase 1 → 6, oracle threshold 95% must_pass / 99% could_pass) is the canonical reference run. Default-mode cost was ~$11, dominated by the Lead Orchestrator on Opus.npx ccusage --instances). Phase 1 alone was $3.47, the largest single phase, Phase 4 only ran one iteration.
Full bench writeup with the why-30%-not-70% breakdown: reference/cost-benchmark.
Worked example: tiny plan with no Phase 4
If your plan finishes inside Phase 3 (data-only or feature-engineering-only project), the iteration-cap savings disappear and only the lead-model swap and headline disable matter. Expected reduction: ~15-20% vs. the ~30% for a full lifecycle plan.When to use low-token mode
Replays
Pro plan / student account
Demos
Ablations
When NOT to use low-token mode
- The first end-to-end run of a research-grade project where you don’t yet know what the right plan looks like. Default mode’s higher iteration count gives you more attempts to discover what works.
- A production launch where the cost of a poorly-converged model is far higher than the token cost.
- Time-sensitive demos: no headlines means less live signal to your stakeholders. (Consider
--low-tokenplus--gate-mode supervisedto keep the human-in-loop signal at gates.)
low_token: true for subsequent replays and ablations.
FAQ
Does low-token mode reduce model quality?
Does low-token mode reduce model quality?
zo status surfaces this so you can re-run without --low-token if the oracle wasn’t met.Can I use low-token mode AND keep Opus for the lead?
Can I use low-token mode AND keep Opus for the lead?
zo build plans/x.md --low-token --lead-model opus keeps Opus lead and applies the rest of the preset (max iterations, no headlines, full-auto, earlier compaction). Useful when the lead model matters for plan decomposition but you want the iteration savings.What happens if Phase 4 doesn't converge in 2 iterations?
What happens if Phase 4 doesn't converge in 2 iterations?
BUDGET_EXHAUSTED. The phase remains in a state where you can re-run without --low-token (or with a higher --max-iterations) and the loop picks up from where it left off, child experiments inherit parent_id so the lineage is preserved. You don’t lose the work, you just unlock more iterations.Is `low_token: true` in the plan equivalent to `--low-token` on the CLI?
Is `low_token: true` in the plan equivalent to `--low-token` on the CLI?
low_token: true AND a --lead-model opus flag runs Opus lead but everything else low-token.Why isn't research-scout in low-token mode?
Why isn't research-scout in low-token mode?
**Active agents:** block, but note that the orchestrator filters research-scout regardless of the active list when low_token=True (this is documented as a known limitation; an opt-out mechanism may land in v2 if requested).Does this work with `zo continue` and `zo draft`?
Does this work with `zo continue` and `zo draft`?
zo continue accepts the same flags as zo build and forwards them. zo draft does NOT yet support --low-token, drafting uses Opus + 100 max-turns by default. Adding low-token support to draft is tracked as a follow-up; the savings are smaller anyway because draft is shorter than build.Do sub-agents (data-engineer, model-builder, etc.) also run on Sonnet in low-token mode?
Do sub-agents (data-engineer, model-builder, etc.) also run on Sonnet in low-token mode?
.md frontmatter declares model: claude-sonnet-4-6. So switching to --low-token doesn’t change the sub-agent cost rate. This is also the reason the savings ceiling is ~30% rather than the 70-80% an “everyone-was-Opus” projection would imply.The lead orchestrator’s prompt in low-token mode does include a Sub-Agent Model Override section that explicitly passes model="claude-sonnet-4-6" to every Agent() call as a defence-in-depth measure (in case some future Claude Code version changes the default). Verified working end-to-end on Claude Code 2.1.107: ps aux during the measured bench showed all four processes (lead + 3 sub-agents) on claude-sonnet-4-6.Can I see how much I'm saving in real time?
Can I see how much I'm saving in real time?
claude CLI logs tokens to ~/.claude/projects/*.jsonl but ZO doesn’t surface them). The Optional integrations (planned) section in the README lists ccusage, a Claude Code token usage monitor, as a near-term opt-in for zo usage. For now: run npx ccusage --json after a build to see per-session totals.Why not lower lead to Haiku?
Why not lower lead to Haiku?
--lead-model haiku if you want to push the savings further (probably 10-15× cheaper than Opus, but with material quality risk).What happens at gates in low-token mode?
What happens at gates in low-token mode?
--gate-mode full-auto. All gates auto-PROCEED when artifact contracts and oracle thresholds pass. To keep human-in-loop gates while still saving on lead/iterations, pass --gate-mode supervised explicitly: zo build plans/x.md --low-token --gate-mode supervised. The CLI flag wins over the preset’s full-auto default.What's NOT in this mode?
What's NOT in this mode?
claude CLI launcher to direct Anthropic SDK calls, out of scope for low-token v1:- Prompt caching: 5-minute TTL cache. Would save ~90% on cached input tokens.
- Batch API: 50% discount for async jobs. Applicable to other Haiku/Sonnet calls ZO might add in future (e.g., end-of-phase reports) but incompatible with interactive tmux. The per-60-second Haiku ticker that originally motivated this note has since been removed.
- Files API: upload static artifacts (plans, specs) once, reference by ID.
- Extended thinking budget tuning: cap thinking tokens explicitly.
See also
- Low-token preset reference: the compact reference card
- Cost benchmark: measured savings on the MNIST reference run
zo buildCLI reference: full options- The plan:
low_tokenandlead_modelfrontmatter fields - Phases and gates: how
max_iterationsandstop_on_tierinteract with Phase-4 gates