Skip to content

Evaluations

Unit tests check code — does the function crash? Evals check behavior — did the workflow produce the right result? cycgraph includes a built-in eval framework for defining test cases, running workflows, and asserting on the final state.

Define a suite, run it, and inspect the report:

import { runEval, EvalSuite } from '@cycgraph/orchestrator';
const suite: EvalSuite = {
name: 'My First Eval',
cases: [
{
name: 'Research pipeline completes',
graph: myGraph,
input: { goal: 'Summarize recent AI news' },
assertions: [
{ type: 'status_equals', expected: 'completed' },
{ type: 'node_visited', node_id: 'researcher' },
{ type: 'memory_contains', key: 'summary' },
],
},
],
};
const report = await runEval(suite);
console.log(`Score: ${report.overall_score}`); // 0.0–1.0
console.log(`Passed: ${report.passed}/${report.total}`);

For each case in the suite:

  1. Build stategoal, constraints, and max_token_budget are extracted from input. The entire input object is seeded into memory.
  2. Run workflow — A GraphRunner executes the graph to completion (or failure/timeout).
  3. Assert — Each assertion is checked against the final WorkflowState.
  4. Score — Case score = passed assertions / total assertions. Overall score = mean of all case scores.

Cases run sequentially to avoid LLM provider contention. If a workflow crashes, the case gets a score of 0 and an error field — other cases continue unaffected.

Check the workflow’s final status:

{ type: 'status_equals', expected: 'completed' }
{ type: 'status_equals', expected: 'waiting' } // for HITL workflows

Verify a specific node executed:

{ type: 'node_visited', node_id: 'researcher' }

Check that a key exists in the final state memory:

{ type: 'memory_contains', key: 'summary' }

Inspect a memory value with three matching modes:

// Exact match (JSON equality)
{ type: 'memory_matches', key: 'count', mode: 'exact', expected: 42, pattern: '' }
// Substring match
{ type: 'memory_matches', key: 'output', mode: 'contains', expected: 'hello', pattern: '' }
// Regex match (against stringified value)
{ type: 'memory_matches', key: 'output', mode: 'regex', pattern: '^hello\\s\\w+$' }

Verify the workflow stayed within its token budget:

{ type: 'token_budget_respected' }

Use an LLM evaluator agent to score the output against criteria. This is the only probabilistic assertion — all others are deterministic.

{
type: 'llm_judge',
criteria: 'Is the summary accurate, well-structured, and under 300 words?',
threshold: 0.75, // minimum passing score (0.0–1.0)
evaluator_agent_id: EVALUATOR_ID, // UUID of a registered evaluator agent
}

The evaluator agent calls generateText() with a structured output schema and returns a score (0.0–1.0), reasoning, and optional suggestions. The assertion passes if score >= threshold.

interface EvalSuite {
name: string;
cases: EvalCase[];
}
interface EvalCase {
name: string; // Human-readable case name
graph: Graph; // The graph to execute
input: Record<string, unknown>; // Initial memory (goal, constraints, etc.)
assertions: EvalAssertion[]; // What to check
timeout_ms?: number; // Workflow timeout (default: 60000ms)
}

runEval() returns a detailed report:

interface EvalReport {
suite_name: string;
cases: EvalCaseResult[];
overall_score: number; // Mean of all case scores (0.0–1.0)
total: number; // Total cases
passed: number; // Cases where all assertions passed
failed: number; // Cases with at least one failure
duration_ms: number; // Wall-clock duration
}
interface EvalCaseResult {
name: string;
passed: boolean; // All assertions passed?
score: number; // Fraction of assertions that passed
duration_ms: number;
assertions: AssertionResult[];
error?: string; // Set if workflow crashed
}
interface AssertionResult {
assertion: EvalAssertion;
passed: boolean;
actual?: unknown; // Observed value
message?: string; // Failure explanation
}

cycgraph ships with three example suites that demonstrate common patterns.

Tests a 2-node tool pipeline (fetchtransform):

const suite: EvalSuite = {
name: 'Linear Completion',
cases: [
{
name: 'Two tool nodes complete successfully',
graph: linearGraph,
input: { goal: 'Fetch and transform data' },
assertions: [
{ type: 'status_equals', expected: 'completed' },
{ type: 'node_visited', node_id: 'fetch' },
{ type: 'node_visited', node_id: 'transform' },
{ type: 'memory_contains', key: 'fetch_result' },
{ type: 'memory_contains', key: 'transform_result' },
],
},
],
};

Tests a router dispatching to a worker:

assertions: [
{ type: 'status_equals', expected: 'completed' },
{ type: 'node_visited', node_id: 'router' },
{ type: 'node_visited', node_id: 'worker' },
{ type: 'memory_contains', key: 'worker_result' },
],

Tests that the workflow pauses at an approval gate (status is waiting, not completed):

assertions: [
{ type: 'status_equals', expected: 'waiting' },
{ type: 'node_visited', node_id: 'prepare' },
{ type: 'node_visited', node_id: 'review' },
{ type: 'memory_contains', key: 'prepare_result' },
],
Terminal window
cd packages/orchestrator
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/linear-completion.ts
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/supervisor-routing.ts
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/hitl-approval.ts
  • A case with 3/5 passing assertions scores 0.6 and is marked passed: false.
  • A case with 0 assertions scores 1.0 (all assertions trivially pass).
  • The suite’s overall_score is the mean of all case scores.
  • A case that crashes before assertions are checked scores 0 with the error captured in error.