Error Handling

The orchestrator has a structured error hierarchy so that every failure mode has a clear type, category, and recovery path. Errors are never swallowed — they either trigger a retry, trip a circuit breaker, or terminate the run with a precise reason.

Error class hierarchy

Class	Module	Key Properties	When Thrown
`BudgetExceededError`	`runner/errors`	`tokensUsed`, `budget`	Token budget exceeded during workflow
`WorkflowTimeoutError`	`runner/errors`	`workflowId`, `runId`, `elapsedMs`	Wall-clock time exceeded
`NodeConfigError`	`runner/errors`	`nodeId`, `nodeType`, `missingField`	Required config missing from a node
`CircuitBreakerOpenError`	`runner/errors`	`nodeId`	Node circuit breaker is open
`EventLogCorruptionError`	`runner/errors`	`runId`	Missing/corrupt events during recovery
`UnsupportedNodeTypeError`	`runner/errors`	`nodeType`	Unknown node type encountered
`NoMatchingEdgeError`	`runner/errors`	`nodeId`	A non-end node has no matching outgoing edge — a routing dead-end
`PermissionDeniedError`	`agent-executor/errors`	—	Agent writes to unauthorized keys
`AgentTimeoutError`	`agent-executor/errors`	`partialUsage`	Agent LLM call exceeds timeout
`AgentExecutionError`	`agent-executor/errors`	`cause`, `retryable`, `partialUsage`	Agent LLM call fails (non-timeout)
`AgentNotFoundError`	`agent-factory/errors`	—	Agent ID not found in a configured registry (fail closed)
`AgentLoadError`	`agent-factory/errors`	`cause`	Registry lookup fails (transient)
`SupervisorConfigError`	`supervisor-executor/errors`	`supervisorId`	Supervisor missing config
`SupervisorRoutingError`	`supervisor-executor/errors`	`chosenNode`, `allowedNodes`	Supervisor routes to invalid node
`ArchitectError`	`architect/errors`	—	Graph generation fails after retries
`MCPServerNotFoundError`	`mcp/errors`	`serverId`	MCP server registry has no entry for the requested ID
`MCPAccessDeniedError`	`mcp/errors`	`serverId`, `agentId`	Agent does not have permission to access the MCP server
`PersistenceUnavailableError`	`db/persistence-health`	—	Consecutive persistence failures exceed threshold
`EventSequenceConflictError`	`db/event-log`	`runId`, `sequenceId`	An event append collided with an existing `(run_id, sequence_id)` — two writers on one run
`StaleClaimError`	`persistence/errors`	`runId`, `staleEpoch`, `currentEpoch`	A fenced write carried an outdated claim epoch — another worker owns the run

All errors extend Error and set this.name to their class name, enabling reliable switch(error.name) handling across module boundaries.

Retryable vs fatal

Error	Retryable?	Notes
`AgentTimeoutError`	Yes	Retried per `failure_policy.max_retries`
`AgentExecutionError`	Yes	With exponential backoff
`MCPServerNotFoundError`	No	Fix tool sources or register the server
`MCPAccessDeniedError`	No	Security violation — fix agent permissions
`CircuitBreakerOpenError`	Auto	Transitions to half-open after timeout
`NodeConfigError`	No	Fix the graph definition
`UnsupportedNodeTypeError`	No	Fix the graph definition
`BudgetExceededError`	No	Budget is exhausted for the run
`WorkflowTimeoutError`	No	Max execution time reached
`EventLogCorruptionError`	No	Manual intervention required
`PersistenceUnavailableError`	No	Halts to prevent data loss
`PermissionDeniedError`	No	Security violation — fix agent permissions
`SupervisorRoutingError`	No	Supervisor bug — fix agent prompt or managed_nodes

Recovery patterns

Node execution — retry with backoff

GraphRunner.executeNodeWithRetry() handles this automatically:

Catch error from node executor
Check retry count against failure_policy.max_retries
If retryable: backoff → retry
If exhausted or fatal: dispatch _fail action

Circuit breaker — automatic recovery

CircuitBreakerManager handles the state machine:

stateDiagram-v2
    direction LR
    CLOSED --> OPEN : failures ≥ threshold
    OPEN --> HALF_OPEN : timeout
    HALF_OPEN --> CLOSED : success
    HALF_OPEN --> OPEN : failure

CircuitBreakerOpenError is thrown when the breaker is OPEN and timeout hasn’t elapsed. After timeout, one probe attempt is allowed (HALF-OPEN state).

Persistence degradation — progressive failure

persistWorkflow() tracks consecutive failures:

1st failure: log warning, continue
2nd failure: log warning, continue
3rd failure (threshold): throw PersistenceUnavailableError → halt workflow

Any success resets the counter to 0.

Compensation / Saga rollback

For workflows with side effects (e.g. API calls, database writes), nodes can declare compensating actions that undo their work on failure.

Nodes with requires_compensation: true push an entry onto the compensation_stack in state after successful execution. On failure, if autoRollback: true is set on the GraphRunner options, the engine executes compensation entries in LIFO order and transitions the workflow to cancelled status.

const graph = createGraph({
  name: 'Saga Example',
  nodes: [
    {
      id: 'charge_payment',
      type: 'tool',
      toolId: 'stripe_charge',
      readKeys: ['order'],
      writeKeys: ['payment_result'],
      requiresCompensation: true,
    },
    {
      id: 'reserve_inventory',
      type: 'tool',
      toolId: 'inventory_reserve',
      readKeys: ['order'],
      writeKeys: ['reservation'],
      requiresCompensation: true,
    },
    // ... more nodes ...
  ],
  edges: [
    { source: 'charge_payment', target: 'reserve_inventory' },
  ],
  startNode: 'charge_payment',
  endNodes: ['confirm_order'],
});

const runner = new GraphRunner(graph, state, {
  autoRollback: true, // execute compensation stack on failure
});

A node with requires_compensation: true pushes a compensation entry onto the compensation_stack after successful execution. The host application is responsible for registering the compensating tool calls — the orchestrator does not infer them from the forward action. If reserve_inventory fails and autoRollback: true is set, the engine drains the stack in LIFO order (calling each registered compensator) and transitions the workflow to cancelled.

When autoRollback is false (the default), the compensation stack is preserved in state but not executed — the host application decides how to handle rollback.

Graceful shutdown

runner.shutdown() signals the engine to stop after the current node completes. The workflow remains in running status (resumable from the last persisted state) and emits a workflow:paused event:

const runner = new GraphRunner(graph, state, {
  persistStateFn: async (s) => persistence.saveWorkflowSnapshot(s),
});

// Start the workflow
const resultPromise = runner.run();

// Later, signal graceful stop
runner.shutdown();

// run() resolves after the current node finishes
const pausedState = await resultPromise;
// pausedState.status === 'running' — resumable

This is useful for deployments, scaling down, or pausing long-running workflows without losing progress.

Event log recovery

GraphRunner.recover(graph, runId, eventLog, options?) rebuilds a ready-to-continue runner:

Load the latest checkpoint (fast path); replay only events after it. Otherwise load all events.
If no events and no checkpoint: throw EventLogCorruptionError
Without a checkpoint, require an _init event in the log (else EventLogCorruptionError)
Verify the events are gap-free — contiguous sequence_ids from the checkpoint anchor; any gap (a lost append) throws EventLogCorruptionError
Check the workflow_started event’s REPLAY_VERSION and warn on a mismatch (reducer-semantics drift)
Replay events through the same pure reducers to reconstruct state — deterministically, since reducers take time from each action’s metadata

At the worker level, recovery also reconciles this replayed state against the latest snapshot and resumes from whichever reflects more progress (see Distributed Execution → Crash recovery).

Error propagation flow

graph TD
    Throw[Node Executor throws] --> Type{Error Type}

    Type --> |Config / Unsupported| Fail[Dispatch _fail]
    Type --> |Permission Denied| Fail

    Type --> |Agent Timeout / Execution| Retries{Retries left?}
    Retries -->|Yes| Retry[Backoff & Retry node]
    Retries -->|No| Fail

    Type --> |Circuit Breaker Open| Fallback[Skip node & advance to fallback edge]

    Type --> |Budget Exceeded| Budget[Dispatch _budget_exceeded]

    Type --> |Persistence Unavailable| Worker[Bubbles up to Worker]

    Fail --> StatusFailed[status = 'failed']
    Budget --> StatusFailed
    Worker --> JobFailed[Worker marks job as failed]

Dead-lettering (distributed execution)

When using the WorkflowWorker, jobs that fail more times than max_attempts are moved to a dead letter queue. Dead-lettered jobs are not retried automatically — they require manual investigation.

The worker emits a job:dead_letter event when this happens:

worker.on('job:dead_letter', ({ jobId, runId, error }) => {
  alertOps(`Job ${jobId} (run ${runId}) dead-lettered: ${error}`);
});

Monitor queue health via getQueueDepth():

const { waiting, active, paused, dead_letter } = await queue.getQueueDepth();

Next steps

Workflow State — the shared state that errors affect
Distributed Execution — worker crash recovery and dead-lettering
Security — how write_keys and taint tracking enforce zero trust
Tracing — correlating errors with distributed traces

Error Handling

Error class hierarchy

Categories

Config / wiring errors — fix graph definition or runner options

Routing errors — dead-end detection

Runtime errors — retry or degrade

Data integrity errors — halt execution

Split-brain errors — abort the local runner immediately

Agent permission errors — security boundary