Eval Harness

The Eval Harness (@cycgraph/evals) is the regression-detection layer for the project. It runs golden trajectories through the real @cycgraph/orchestrator, @cycgraph/memory, and @cycgraph/context-engine packages, captures observable behavior, and tells you whether a change in one package silently broke another.

It’s distinct from the orchestrator’s lightweight built-in runEval (which is a per-graph assertion framework). @cycgraph/evals operates one level up — it gates whole-package behavior across release cycles.

When to reach for this package

You shipped a change to one package and want to know if it altered observable behavior elsewhere
You want a stable signal for “did the LLM judge actually regress, or was it a flaky sample?”
You want to detect drift that hides under your absolute gate (e.g., a suite going from 0% to 4% drift while the ceiling is 5%)
You need a record of what the system actually does on a tagged commit so future regressions are attributable

Architecture

flowchart TD
  CLI["npm run evals --workspace=packages/evals"]
  CLI --> Runner["runEvals(config)"]
  Runner --> Det["Deterministic Track<br/>(in-process library calls)"]
  Runner --> Sem["SUT-driven Semantic Track<br/>(runSutDispatch + multi-sample)"]
  Det --> Drift["computeDrift()"]
  Sem --> Drift
  Drift --> Report["DriftReport"]
  Report --> Baseline["compareBaseline()"]
  Baseline --> Delta["BaselineDelta"]
  Report --> Out["stdout + GH annotations"]
  Delta --> Out

  Goldens[("golden/data/*.sqlite.gz")] -. drives .-> Det
  Goldens -. drives .-> Sem

Two tracks, one gate

Deterministic track — pure library calls in-process, no LLM. Fast (<1s), free, and produces sharp signal: segmentation, dedup, budget allocation, subgraph extraction, conflict detection, etc.
Semantic track — SUT-driven: each trajectory is dispatched through the real package code via runSutDispatch, and the LLM judge grades the observed output against the recorded golden. Configurable provider (Ollama locally, GPT-4o in CI) and configurable sample count for stability.

Both tracks feed into a single DriftReport aggregated by computeDrift(). The gate triggers when aggregate drift exceeds the configured ceiling.

Multi-sample stability

LLM judges are non-deterministic. A single low-scoring run can tank the gate; a single high-scoring run can mask a real regression. The semantic track defaults to 3 samples per metric in CI, using a majority-vote outcome and surfacing tests with inconsistent results as flaky — distinct from drift failure.

See Drift & Baselines for how flaky-vs-regressed is distinguished in the report.

For a runnable, adversarially-tested demonstration of eval-gated learning (poisoned lessons evicted on outcome evidence), see packages/evals/examples/eval-gated-learning/.

Eval Assertions — the four assertion families and when to use each
Drift & Baselines — what the drift metric means and how baselines extend it
Running Evals — CLI usage end to end
Recording Goldens — re-record from real SUT runs
Adding an Eval Suite — build a new suite from scratch
Adding a SUT Handler — extend the SUT to cover a new tag family

Eval Harness

When to reach for this package

Architecture

Two tracks, one gate

Multi-sample stability

Related