Recording Goldens
Goldens are the reference points against which drift is measured. Each golden becomes a snapshot of what the code actually produced at a tagged commit.
The recording script
Section titled “The recording script”npx tsx packages/evals/scripts/record-goldens.ts --suite <suite> [flags]| Flag | Default | What it does |
|---|---|---|
--suite | orchestrator | Which suite to record |
--model | claude-sonnet-4-20250514 | Recording model (orchestrator only) |
--samples | 3 | Samples per trajectory for stability checking |
--commit | (off — dry run) | Actually overwrite the SQLite dataset |
--plan-only | off | Print the routing table and exit; no SUT invocations |
--output | golden/recording-diff-<suite>.json | Where to write the diff report |
Preview routing without running anything
Section titled “Preview routing without running anything”--plan-only shows which graph builder or handler each trajectory dispatches to. Useful when adding a new trajectory tag and verifying the planner picks it up:
$ npx tsx scripts/record-goldens.ts --suite orchestrator --plan-only
[record-goldens] suite=orchestrator model=claude-sonnet-4-20250514 ... [1/18] PLAN e503a104 Single-node: research TypeScript history — graph=single-agent tool=web_search [2/18] PLAN 0c9fbbc0 Single-node: summarize document — graph=single-agent tool=web_search ... [10/18] PLAN 9b71dc96 Delegation: research and writing team — graph=supervisor tool=none ...
[record-goldens] plan totals: supported=18 skipped=0Any unsupported trajectories show up as SKIP with a reason (e.g., “No reference graph for tags [some-future-tag] yet”).
Dry run (default)
Section titled “Dry run (default)”Without --commit, the script samples each trajectory, builds the diff report, and writes it to disk — but does not overwrite the SQLite dataset.
$ npx tsx scripts/record-goldens.ts --suite memory
[record-goldens] suite=memory model=... samples=3 commit=false plan-only=false [1/18] REC e759f3ad Segmentation: time-gap based episode splitting ... [18/18] REC 5ed8519c Conflict: no false positive on unrelated facts
[record-goldens] Diff written to: golden/recording-diff-memory.json[record-goldens] Totals: recorded=18 skipped=0 unstable=0 errored=0[record-goldens] Dry run — pass --commit to overwrite the SQLite dataset.The diff report contains, for each trajectory:
- The old hand-authored or recorded
expectedOutput - The new observed output
- All raw sample data (for unstable cases, you can see exactly which sample diverged)
Inspect this before committing. Look for:
- Tests that switched from passing-against-intent to failing-against-reality (good — finds wrong goldens)
- Tests that flipped meaning entirely (suspicious — investigate)
- Unstable tests where samples disagreed (judge or library is non-deterministic; investigate before committing)
Commit
Section titled “Commit”npx tsx scripts/record-goldens.ts --suite memory --commitRefuses to commit if any trajectory errored or was unstable across samples — the script’s job is to lock in stable behavior, not paper over fragility.
On commit, the script:
- Writes
golden/data/<suite>-v1.sqlite.gzwith the new trajectories - Updates
golden/manifest.json(sha256, count, schema version, timestamp) - Tags each new trajectory with
source: 'recorded'+recordedAt+recordedModel+recordedCommit
Per-suite specifics
Section titled “Per-suite specifics”Memory + context-engine (no LLM)
Section titled “Memory + context-engine (no LLM)”Recording is a deterministic library snapshot. Each trajectory’s input runs through the appropriate library API (segmenter, extractor, dedup, etc.) and the output is serialized as the new expected value.
npx tsx scripts/record-goldens.ts --suite memorynpx tsx scripts/record-goldens.ts --suite context-engineNo API keys required. Takes <2s.
Orchestrator (requires Anthropic key)
Section titled “Orchestrator (requires Anthropic key)”Each trajectory’s input is mapped to a reference graph by tag:
linear/basic/no-tools→single-agentsupervisor/multi-agent/delegation→supervisorbranching/conditional→branchingerror/retry→retry(with mocked flaky tool fixtures)budget/state→single-agent(with appropriate mocked tools)
The graph runs through GraphRunner against the real LLM. Tool calls go through a mock resolver so recording is network-free apart from the LLM itself.
ANTHROPIC_API_KEY=sk-ant-... \ npx tsx scripts/record-goldens.ts --suite orchestratorCost is bounded by the per-trajectory token estimate × 18 trajectories × 3 samples. For Sonnet, that’s roughly $0.50–$1.00 per recording session.
Stability checking
Section titled “Stability checking”Every recording invocation samples each trajectory N times (default 3) and verifies that all samples produced the same tool-call sequence shape. If they didn’t:
- The trajectory is flagged
unstablein the diff - The script refuses to commit
- The unstable samples are included in the diff so you can see what diverged
This catches:
- Non-determinism in the LLM (which is expected sometimes)
- Race conditions in the graph runner
- Tool fixtures with state leaking across samples (this should be impossible — fixtures are constructed per-sample — but if it happens the stability check catches it)
After recording
Section titled “After recording”# Confirm the new dataset round-trips through the schemanpm test --workspace=packages/evals
# Run the eval harness against the new goldensnpm run evals --workspace=packages/evals -- --deterministic-only
# If you want a fresh baseline against the new goldensrm -f packages/evals/golden/baselines/main-latest.jsonnpm run evals --workspace=packages/evals -- --deterministic-only --baselineCommit the .sqlite.gz files + the updated manifest together; reviewers can spot-check by re-running the recording locally.
Related
Section titled “Related”- Eval Harness — why goldens are recorded rather than authored
- Adding a SUT Handler — extend the recorder to cover a new tag family
- Adding an Eval Suite — add an entirely new suite