Adding a SUT Handler
The SUT (System-Under-Test) layer in @cycgraph/evals is what makes recording possible — it wraps the real library APIs and returns observable output for each trajectory. When you add a new trajectory tag family that the existing handlers don’t cover, you extend the SUT to handle it.
This guide walks through both kinds of extension:
- Deterministic suites (memory, context-engine) — add a tag-routed handler function
- Orchestrator suite — add a new reference graph builder + planner case
Pattern 1: Deterministic SUT handler
Section titled “Pattern 1: Deterministic SUT handler”The memory and context-engine SUTs follow the same shape — a tag-matching dispatcher that resolves the trajectory’s input through one of several library entry points.
Anatomy of a handler
Section titled “Anatomy of a handler”// src/sut/memory-sut.ts (excerpt)
interface MemoryHandler { name: string; matches(tags: Set<string>): boolean; run(input: string): Promise<unknown>;}
const segmentationHandler: MemoryHandler = { name: 'segmentation', matches: (tags) => tags.has('segmentation') || tags.has('episodes'), async run(input) { const messages = parseMessages(input); const segmenter = new SimpleEpisodeSegmenter({ gap_threshold_ms: 30 * 60 * 1000, }); const episodes = await segmenter.segment(messages); return { episodes: episodes.length, topics: episodes.map(e => e.topic), message_counts: episodes.map(e => e.messages.length), }; },};Step 1: Pick a tag family
Section titled “Step 1: Pick a tag family”Look at the trajectory tags that have no handler today. The record-goldens.ts --plan-only script lists skipped trajectories with their tags:
[10/18] SKIP 0252e64a Subgraph: 1-hop returns direct neighbors — No memory handler for tags [subgraph, graph] yetThat tells you the family is subgraph/graph.
Step 2: Write the handler
Section titled “Step 2: Write the handler”Add a new Handler constant in the suite’s SUT file:
const subgraphHandler: MemoryHandler = { name: 'subgraph', matches: (tags) => tags.has('subgraph') || tags.has('graph'), async run(input) { const { seed_entities, max_hops, valid_at } = parseSubgraphInput(input); const { store } = await buildSeededMemoryGraph(); const result = await extractSubgraph(store, seed_entities, { max_hops, valid_at, });
return { entities: result.entities.map(e => e.id), relationships: result.relationships.map(r => ({ type: r.relation_type, source: r.source_id, target: r.target_id, })), }; },};Things to keep in mind:
- Output shape matters. This object becomes the new
expectedOutputfor every trajectory the handler runs. Keep it stable across versions — adding fields is safer than removing them. - Parsers should tolerate input shape variants. Trajectory inputs are JSON strings authored at different times; be permissive about what you accept (
input.max_hops ?? 1). - Side effects belong inside the handler, not at module scope. Build a fresh store per call so concurrent recordings don’t interfere.
Step 3: Register the handler
Section titled “Step 3: Register the handler”Add the handler to the HANDLERS array. Order matters when one trajectory’s tags could match multiple handlers — earlier entries win:
const HANDLERS: MemoryHandler[] = [ segmentationHandler, temporalHandler, extractionHandler, subgraphHandler, // ← new consolidationHandler, conflictHandler,];Step 4: Use shared fixtures where appropriate
Section titled “Step 4: Use shared fixtures where appropriate”If your handler needs a seeded store (entities, relationships, facts), put the seeding logic in src/sut/fixtures/. The memory suite uses buildSeededMemoryGraph() for subgraph / consolidation / conflict handlers so all three share one canonical world.
export async function buildSeededMemoryGraph(): Promise<{ store: InMemoryMemoryStore; index: InMemoryMemoryIndex;}> { // ... fresh store + index every call}Step 5: Write tests
Section titled “Step 5: Write tests”For each new handler, add a describe block to the suite’s test file:
describe('runMemorySut — subgraph', () => { it('extracts a 1-hop neighborhood from the seeded fixture', async () => { const input = JSON.stringify({ seed_entities: ['e-alice'], max_hops: 1 }); const result = await runMemorySut({ trajectory: makeTrajectory(['subgraph', 'graph'], input), }); expect(result.status).toBe('completed'); const parsed = JSON.parse(result.output); expect(parsed.entities).toContain('e-alice'); });});Run them:
npx vitest run test/sut/memory-sut.test.tsStep 6: Confirm planner coverage
Section titled “Step 6: Confirm planner coverage”npx tsx scripts/record-goldens.ts --suite memory --plan-onlyThe previously-skipped trajectories should now show as supported.
Pattern 2: Orchestrator reference graph
Section titled “Pattern 2: Orchestrator reference graph”The orchestrator suite uses reference graphs instead of handlers — a trajectory’s tags map to a graph builder (single-agent, supervisor, branching, retry), and the SUT runs that graph against a real LLM.
Step 1: Pick a graph shape
Section titled “Step 1: Pick a graph shape”If your new tag family fits an existing graph (most do), skip to the planner change. Add a new graph only when the topology genuinely differs — e.g., a swarm graph, an evaluator-optimizer loop, a HITL graph.
Most variations belong inside the agent’s prompt or tool fixtures rather than the graph topology. For example, branching reuses a single-agent graph with a structured-output prompt; retry reuses single-agent with a stateful flaky tool.
Step 2: Build the graph
Section titled “Step 2: Build the graph”Follow the pattern from src/sut/graphs/supervisor.ts:
import { InMemoryAgentRegistry, createGraph, createWorkflowState,} from '@cycgraph/orchestrator';import type { Graph, WorkflowState, AgentRegistry } from '@cycgraph/orchestrator';
export interface YourGraphOptions { input: string; model?: string; outputKey?: string;}
export interface YourGraphArtifacts { graph: Graph; initialState: WorkflowState; agentRegistry: AgentRegistry; outputKey: string;}
export function buildYourGraph(opts: YourGraphOptions): YourGraphArtifacts { // ... register agents, build graph, return artifacts}Every builder returns the same {graph, initialState, agentRegistry, outputKey} shape so the recording dispatcher can call them uniformly.
Step 3: Add a tool-fixture profile if needed
Section titled “Step 3: Add a tool-fixture profile if needed”If your graph uses MCP-typed tools, add a OrchestratorToolKind and a resolveToolFixtures case in scripts/record-goldens.ts. Stateful fixtures (closures over counters) MUST be re-created per sample — that’s why fixture resolution happens inside the per-sample dispatcher rather than in the planner.
case 'your_tool_kind': return { toolResponses: { your_tool: createYourTool(), }, toolDescriptions: { your_tool: 'Description shown to the LLM', }, };Step 4: Route trajectories to the new graph
Section titled “Step 4: Route trajectories to the new graph”Update src/sut/recording-planner.ts:
function planOrchestratorTrajectory(trajectory: GoldenTrajectory): RecordingPlan { const tags = new Set(trajectory.tags ?? []);
if (tags.has('your-new-tag')) { return { trajectory, supported: true, graphKind: 'your-graph', toolKind: 'your_tool_kind', }; }
// ... existing rules}Add 'your-graph' to the OrchestratorGraphKind union type at the top of the file.
Step 5: Dispatch in the recording script
Section titled “Step 5: Dispatch in the recording script”Update runOrchestratorSample in scripts/record-goldens.ts to handle plan.graphKind === 'your-graph':
if (plan.graphKind === 'your-graph') { const artifacts = buildYourGraph({ input: plan.trajectory.input, model, }); return runOrchestratorSut({ graph: artifacts.graph, initialState: artifacts.initialState, agentRegistry: artifacts.agentRegistry, toolResponses, toolDescriptions, outputKey: artifacts.outputKey, });}Step 6: Test the planner classification
Section titled “Step 6: Test the planner classification”Add a case to test/sut/recording-planner.test.ts:
it('routes your-new-tag trajectories to your-graph', () => { const plan = planForTrajectory( 'orchestrator', makeTrajectory('orchestrator', ['your-new-tag']), ); expect(plan.graphKind).toBe('your-graph');});Step 7: Smoke-test recording
Section titled “Step 7: Smoke-test recording”ANTHROPIC_API_KEY=sk-ant-... \ npx tsx scripts/record-goldens.ts --suite orchestratorIf everything’s wired correctly, the previously-skipped trajectory now shows REC and contributes to the diff report.
Anti-patterns
Section titled “Anti-patterns”- Sharing mutable state between handler calls. Build fresh stores/fixtures per
run()invocation. Test concurrent runs to confirm. - Hard-coding output shapes that include nondeterministic values. UUIDs and timestamps will diverge across samples and fail the stability check. Either strip them from the output or pin them at the input.
- Adding a “supervisor + tools” graph when “single-agent + tools” suffices. Topology variations are expensive to test; prefer reusing graphs with different prompt/tool configs.
- Putting LLM-bound code in a deterministic handler. Memory and context-engine handlers should never call an LLM, even indirectly. If you need LLM judgment, the trajectory belongs in the orchestrator suite.
Related
Section titled “Related”- Eval Harness — overall architecture
- Recording Goldens — how the SUT layer plugs into recording
- Adding an Eval Suite — when you need a whole new suite, not just a new handler