Running the Eval Harness
The eval harness has one entry point — npm run evals — and a handful of flags that compose. This guide walks through the modes you’ll actually use.
The four common modes
Section titled “The four common modes”1. Fast local sanity check (no LLM)
Section titled “1. Fast local sanity check (no LLM)”npm run evals --workspace=packages/evals -- --deterministic-onlyRuns ~36 library-level tests in <1 second. Catches:
- Compression ratio regressions
- Dedup behavior changes
- Budget enforcement breaks
- Memory segmentation, temporal filtering, subgraph extraction
- Conflict detection rules
Use as a pre-commit hook or a fast PR signal. Does not exercise any LLM-bound code path.
2. Full local evaluation (Ollama)
Section titled “2. Full local evaluation (Ollama)”npm run evals --workspace=packages/evalsRuns both tracks against a local Ollama model. Free but slower (~30s) and the judge is weaker than a frontier model — useful for “does this even compile” but not for production gating.
3. Full CI evaluation (GPT-4o, multi-sample)
Section titled “3. Full CI evaluation (GPT-4o, multi-sample)”OPENAI_API_KEY=sk-... npm run evals:ci --workspace=packages/evalsSame as local but:
- Uses
gpt-4oas the judge - 3 samples per semantic test (instead of 1)
- Higher concurrency (
8vs2) - Hides the progress bar (CI-friendly output)
The CI mode is what should run in scheduled regression checks. Cost is bounded by the per-test estimate; you’ll see a warning if the projection exceeds $5.
4. Baselined CI evaluation
Section titled “4. Baselined CI evaluation”OPENAI_API_KEY=sk-... \ npm run evals:ci --workspace=packages/evals -- --baselineAdds baseline comparison on top of mode 3:
- Loads
golden/baselines/main-latest.json - Compares current drift to the prior snapshot
- Exits with code
2if any suite regressed by more than 5pp (configurable) - Overwrites the baseline only if the run passed both the absolute gate and the relative comparison
See Drift & Baselines for what counts as a regression.
CLI reference
Section titled “CLI reference”| Flag | Type | Default | What it does |
|---|---|---|---|
--mode | local | ci | local | Picks provider, concurrency, sample defaults |
--suite | suite name | (all) | Restricts to one suite — orchestrator, memory, context-engine, integration |
--samples | int | 1 local / 3 ci | Number of independent semantic samples per test |
--deterministic-only | flag | false | Skip the semantic track entirely |
--baseline | flag | false | Load + compare + persist golden/baselines/main-latest.json |
--baseline-noise-floor | float | 5.0 | Minimum pp delta to flag as a regression |
--commit | string | (auto) | Short git SHA stamped onto a new baseline snapshot |
Combining flags
Section titled “Combining flags”These compose freely. Some useful combinations:
# One suite onlynpm run evals -- --suite memory
# Multi-sample to detect flakiness, no baselinenpm run evals:ci -- --samples 5
# Tight baseline tolerancenpm run evals:ci -- --baseline --baseline-noise-floor 1.0
# CI mode but only library tests (fast PR gate without API costs)npm run evals -- --deterministic-onlyEnvironment variables
Section titled “Environment variables”| Variable | Required | Default | Purpose |
|---|---|---|---|
OPENAI_API_KEY | CI semantic track | — | GPT-4o judge |
OLLAMA_BASE_URL | Local | http://localhost:11434 | Ollama endpoint |
OLLAMA_MODEL | Local | llama3:8b-instruct-q4_K_M | Local judge model |
EVAL_MAX_CONCURRENCY | No | 2 / 8 | Parallel evaluations |
EVAL_DRIFT_CEILING | No | 5.0 | Drift % gate threshold |
CLI flags override env vars where both apply.
Exit codes
Section titled “Exit codes”| Code | Meaning |
|---|---|
0 | Clean run |
1 | Drift gate failed OR a suite failed to load |
2 | Baseline regression detected but drift gate passed |
For CI scripts, the canonical pattern is to hard-fail on 1 and warn-only (or open an issue) on 2.
Output
Section titled “Output”By default the runner writes to stdout. CI mode additionally emits GitHub Actions annotations (::error, ::warning) so failures appear inline on PRs.
Future enhancements (deferred from Phase 1):
- Structured
report.jsonartifact - GitHub step summary
- JUnit XML for test-result aggregators
Troubleshooting
Section titled “Troubleshooting”“Ollama: connection refused” — Start the server with ollama serve. Confirm the model is pulled: ollama list.
“Cost warning: estimated $X exceeds threshold” — The CI mode estimates total token usage. The default warning fires at $5; raise it via the provider option in code or accept it and continue (the warning is non-blocking).
“Baseline schema version mismatch” — You’re loading an older snapshot than the current code understands. Delete golden/baselines/main-latest.json and re-run with --baseline to bootstrap a fresh one.
“No prior baseline” — Expected on the first run with --baseline. The current run becomes the baseline.
Related
Section titled “Related”- Eval Harness — what’s running under the hood
- Recording Goldens — refresh the trajectories the harness uses
- Drift & Baselines — what the numbers mean