Skip to content

Running the Eval Harness

The eval harness has one entry point — npm run evals — and a handful of flags that compose. This guide walks through the modes you’ll actually use.

Terminal window
npm run evals --workspace=packages/evals -- --deterministic-only

Runs ~36 library-level tests in <1 second. Catches:

  • Compression ratio regressions
  • Dedup behavior changes
  • Budget enforcement breaks
  • Memory segmentation, temporal filtering, subgraph extraction
  • Conflict detection rules

Use as a pre-commit hook or a fast PR signal. Does not exercise any LLM-bound code path.

Terminal window
npm run evals --workspace=packages/evals

Runs both tracks against a local Ollama model. Free but slower (~30s) and the judge is weaker than a frontier model — useful for “does this even compile” but not for production gating.

3. Full CI evaluation (GPT-4o, multi-sample)

Section titled “3. Full CI evaluation (GPT-4o, multi-sample)”
Terminal window
OPENAI_API_KEY=sk-... npm run evals:ci --workspace=packages/evals

Same as local but:

  • Uses gpt-4o as the judge
  • 3 samples per semantic test (instead of 1)
  • Higher concurrency (8 vs 2)
  • Hides the progress bar (CI-friendly output)

The CI mode is what should run in scheduled regression checks. Cost is bounded by the per-test estimate; you’ll see a warning if the projection exceeds $5.

Terminal window
OPENAI_API_KEY=sk-... \
npm run evals:ci --workspace=packages/evals -- --baseline

Adds baseline comparison on top of mode 3:

  • Loads golden/baselines/main-latest.json
  • Compares current drift to the prior snapshot
  • Exits with code 2 if any suite regressed by more than 5pp (configurable)
  • Overwrites the baseline only if the run passed both the absolute gate and the relative comparison

See Drift & Baselines for what counts as a regression.

FlagTypeDefaultWhat it does
--modelocal | cilocalPicks provider, concurrency, sample defaults
--suitesuite name(all)Restricts to one suite — orchestrator, memory, context-engine, integration
--samplesint1 local / 3 ciNumber of independent semantic samples per test
--deterministic-onlyflagfalseSkip the semantic track entirely
--baselineflagfalseLoad + compare + persist golden/baselines/main-latest.json
--baseline-noise-floorfloat5.0Minimum pp delta to flag as a regression
--commitstring(auto)Short git SHA stamped onto a new baseline snapshot

These compose freely. Some useful combinations:

Terminal window
# One suite only
npm run evals -- --suite memory
# Multi-sample to detect flakiness, no baseline
npm run evals:ci -- --samples 5
# Tight baseline tolerance
npm run evals:ci -- --baseline --baseline-noise-floor 1.0
# CI mode but only library tests (fast PR gate without API costs)
npm run evals -- --deterministic-only
VariableRequiredDefaultPurpose
OPENAI_API_KEYCI semantic trackGPT-4o judge
OLLAMA_BASE_URLLocalhttp://localhost:11434Ollama endpoint
OLLAMA_MODELLocalllama3:8b-instruct-q4_K_MLocal judge model
EVAL_MAX_CONCURRENCYNo2 / 8Parallel evaluations
EVAL_DRIFT_CEILINGNo5.0Drift % gate threshold

CLI flags override env vars where both apply.

CodeMeaning
0Clean run
1Drift gate failed OR a suite failed to load
2Baseline regression detected but drift gate passed

For CI scripts, the canonical pattern is to hard-fail on 1 and warn-only (or open an issue) on 2.

By default the runner writes to stdout. CI mode additionally emits GitHub Actions annotations (::error, ::warning) so failures appear inline on PRs.

Future enhancements (deferred from Phase 1):

  • Structured report.json artifact
  • GitHub step summary
  • JUnit XML for test-result aggregators

“Ollama: connection refused” — Start the server with ollama serve. Confirm the model is pulled: ollama list.

“Cost warning: estimated $X exceeds threshold” — The CI mode estimates total token usage. The default warning fires at $5; raise it via the provider option in code or accept it and continue (the warning is non-blocking).

“Baseline schema version mismatch” — You’re loading an older snapshot than the current code understands. Delete golden/baselines/main-latest.json and re-run with --baseline to bootstrap a fresh one.

“No prior baseline” — Expected on the first run with --baseline. The current run becomes the baseline.