Why oracles matter
Autonomous agents without measurable criteria don’t stop. They produce more code, more reports, more iterations — none of which are verified. The oracle is the system’s source of truth: until the oracle says PASS, no deliverable is complete. Karpathy’s framing: autonomy scales through rigorous specification, not natural-language ambiguity. A vague oracle (“model should perform well”) gives agents nowhere to anchor. A hard oracle (“test accuracy ≥ 0.95 on a held-out 10K-sample MNIST test set, computed viacorrect / total over a forward pass”) gives them a contract.
Anatomy of an oracle
Primary metric
A single named scalar. Examples:
test_accuracy, mAP@0.5, BLEU, latency_ms_p95, feature_importance_stability. Picking the right metric is the hardest part of plan-writing.Ground truth source
Where the labels come from, exactly.
torchvision.datasets.MNIST(train=False), or data/labels/holdout_2024Q3.csv, or human-annotated by panel of 3 with majority vote.Evaluation method
The deterministic procedure that produces the metric value. Code-level specificity. “Forward pass over the full test set, argmax over softmax, compute correct / total.”
Target threshold (tiered)
Three tiers — must_pass (the floor; below this is failure), should_pass (the goal), could_pass (the stretch). Each is a number.
Evaluation frequency
When does the oracle run? After every training epoch, after each Phase 4 iteration, only at Gate 5, etc.
Tiered success
ZO uses three tiers because real projects rarely have a single accept/reject line:stop_on_tier to decide when to stop iterating. Default is must_pass (stop at the floor); set to could_pass to keep iterating until you hit research-grade or run out of budget.
Examples
- Image classification (MNIST)
- Image classification (CIFAR-10)
- Time-series forecasting
What makes a bad oracle
When the oracle is wrong (chosen poorly, or revealed wrong by Phase 4 results), the human updatesplan.md and agents re-run. The orchestrator detects the diff and re-plans against the new oracle.
Statistical significance
For models trained on small or noisy data, raw metric values aren’t enough — the question is whether observed performance is reliably better than baseline. The oracle’s statistical-significance section (optional) defines:- Bootstrap confidence intervals (e.g. 95% Wilson CI on classification accuracy)
- Paired tests against a baseline model
- Minimum effect-size thresholds
result.md.
Next
Phases & gates
How the oracle is checked against deliverables at each gate.
The plan
Where the oracle lives and how it composes with other plan sections.