Adding an Eval Suite
A suite in @cycgraph/evals is a per-package collection of golden trajectories with two optional execution tracks:
- A deterministic track that runs library code in-process and emits assertion results
- An SUT-driven semantic track that dispatches trajectories through the real package code and grades the observed output with an LLM judge
This guide walks through adding a new suite — covering the directory layout, the runner registration, and the recording integration.
When to add a suite
Section titled “When to add a suite”- You’re adding a new sibling package (e.g.,
@cycgraph/audio-engine) that needs its own regression gate - An existing package has grown enough that its existing suite is unwieldy and warrants splitting
- You want a cross-cutting integration suite that exercises a multi-package flow (like the existing
integrationsuite)
Don’t add a suite to test a single feature — add a trajectory to an existing suite instead.
Directory layout
Section titled “Directory layout”packages/evals/src/suites/<your-suite>/└── suite.ts # runDeterministic + buildSutSuite exportsNo prompts.ts or assertions.ts files — the SUT-driven contract declares metrics and trajectory IDs directly. The rubric metric IS the prompt.
Step 1: declare the suite name in the schema
Section titled “Step 1: declare the suite name in the schema”Add your suite name to SuiteNameSchema in src/dataset/schema.ts:
export const SuiteNameSchema = z.enum([ 'context-engine', 'memory', 'orchestrator', 'integration', 'your-suite', // ← add it here]);Type narrowing through SuiteName will surface the rest of the changes you need.
Step 2: define the suite module
Section titled “Step 2: define the suite module”The suite module exports two functions (both optional, but most suites implement at least one):
import type { TestCaseResults } from '../../assertions/drift-calculator.js';import type { SutSuiteConfig } from '../sut-contract.js';import { loadGoldenTrajectories } from '../../dataset/loader.js';import { FAITHFULNESS } from '../../assertions/semantic-judge.js';
/** Deterministic track — no LLM needed. */export async function runDeterministic(): Promise<TestCaseResults[]> { // ... run library calls, collect DeterministicResult arrays, return as // TestCaseResults entries with suite: 'your-suite'}
/** SUT-driven semantic track — declares trajectories + metrics. */export async function buildSutSuite(): Promise<SutSuiteConfig> { const trajectories = loadGoldenTrajectories('your-suite'); return { name: 'your-suite', tests: trajectories.map(t => ({ trajectoryId: t.id, description: t.description, metrics: [{ metric: FAITHFULNESS }], structuralAssertions: false, })), };}A suite that only has a deterministic track returns an empty SUT config ({ name, tests: [] }); the runner accepts that and skips the semantic phase for it.
Deterministic-only example
Section titled “Deterministic-only example”import { assertGreaterThanOrEqual, assertEqual,} from '../../assertions/deterministic.js';
export async function runDeterministic(): Promise<TestCaseResults[]> { const results: TestCaseResults[] = [];
results.push({ suite: 'your-suite', zodResults: [], semanticResults: [], deterministicResults: [ assertGreaterThanOrEqual('feature_x_threshold', 0.87, 0.80, 'X must hit 0.80'), assertEqual('feature_y_count', 5, 5, 'Y returns exactly 5 items'), ], });
return results;}
export async function buildSutSuite(): Promise<SutSuiteConfig> { return { name: 'your-suite', tests: [] };}Semantic with multiple metrics
Section titled “Semantic with multiple metrics”import { ANSWER_RELEVANCY, FAITHFULNESS,} from '../../assertions/semantic-judge.js';
export async function buildSutSuite(): Promise<SutSuiteConfig> { const trajectories = loadGoldenTrajectories('your-suite'); return { name: 'your-suite', tests: trajectories.map(t => ({ trajectoryId: t.id, description: t.description, metrics: [ { metric: ANSWER_RELEVANCY, threshold: 0.85 }, { metric: FAITHFULNESS }, ], // Enable when the trajectory's golden declares expectedToolCalls structuralAssertions: t.expectedToolCalls !== undefined, })), };}Step 3: register in the runner’s static dispatch
Section titled “Step 3: register in the runner’s static dispatch”The runner uses a static import map (not a dynamic path) to load suites. Add a case in both dispatchers in src/runner/runner.ts:
async function getDeterministicEntrypoint( suiteName: SuiteName,): Promise<DeterministicSuiteModule | null> { switch (suiteName) { // ... existing cases case 'your-suite': return await import('../suites/your-suite/suite.js') as DeterministicSuiteModule; }}
async function loadSutSuiteConfig( suiteName: SuiteName,): Promise<SutSuiteConfig> { switch (suiteName) { // ... existing cases case 'your-suite': { const mod = await import('../suites/your-suite/suite.js'); return mod.buildSutSuite(); } }}Static imports are intentional — they make missing suites a compile error rather than a runtime mystery and they’re bundler-friendly.
Step 4: add the suite to default loading (optional)
Section titled “Step 4: add the suite to default loading (optional)”If your suite should run by default (no --suite filter), update the default in runEvals:
const suitesToLoad = config.suites ?? (['context-engine', 'memory', 'orchestrator', 'your-suite'] as SuiteName[]);Otherwise users need to opt in via --suite your-suite.
Step 5: seed golden trajectories
Section titled “Step 5: seed golden trajectories”Hand-author your first trajectories in scripts/seed-golden-v2.ts (or a new seed script for your suite):
const yourSuiteTrajectories: GoldenTrajectory[] = [ t('your-suite', 'Test: descriptive name', 'input content', 'expected output', ['tag1', 'tag2']), // ...];
writeGoldenDataset('your-suite', yourSuiteTrajectories, '3.0.0');Run the seed script:
npx tsx packages/evals/scripts/seed-golden-v2.tsThis creates golden/data/your-suite-v1.sqlite.gz and updates the manifest.
Step 6: extend the recording planner
Section titled “Step 6: extend the recording planner”Hand-authored expected outputs are fine to start. When you’re ready to ground them in real behavior, extend src/sut/recording-planner.ts to classify your suite’s trajectories:
function planYourSuiteTrajectory(trajectory: GoldenTrajectory): RecordingPlan { if (isYourSuiteTrajectorySupported(trajectory)) { return { trajectory, supported: true }; } return { trajectory, supported: false, skipReason: '...' };}
// In planForTrajectory:case 'your-suite': return planYourSuiteTrajectory(trajectory);Then build a src/sut/your-suite-sut.ts with handlers for each tag family — see Adding a SUT Handler.
Step 7: write tests
Section titled “Step 7: write tests”Every suite should have a test that confirms:
- The deterministic track passes on known-good fixtures
buildSutSuite()returns a config referencing valid trajectory IDs- Each handler dispatches correctly for its tag family
- The trajectories round-trip through the schema
Look at test/suites/memory.test.ts for the canonical pattern.
Step 8: verify end-to-end
Section titled “Step 8: verify end-to-end”# Type check + buildnpm run build --workspace=packages/evals
# Run only your suitenpm run evals --workspace=packages/evals -- --deterministic-only --suite your-suite
# Confirm trajectories loadnpx tsx packages/evals/scripts/record-goldens.ts --suite your-suite --plan-only
# Full semantic gate with stub or local LLMnpm run evals --workspace=packages/evals -- --suite your-suiteRelated
Section titled “Related”- Eval Harness — overall architecture
- Adding a SUT Handler — extending the recording layer
- Eval Assertions — choosing the right assertion family