Use case

Multi-Agent Orchestration

Your LangGraph pipeline has a planner, three sub-agents, and a tool-calling worker. One sub-agent returns a prose string where the orchestrator expected structured JSON. The parse fails silently. Every downstream agent receives garbage state. The failure surfaces at the final output node, stripped of the chain of custody you need to diagnose it.

Where things go wrong

Subagent schema drift

A planner agent calls a sub-agent expecting `{"status": "ok", "data": [...]}`. The sub-agent returns a prose explanation instead. The orchestrator’s JSON parse fails silently or falls back to a default that propagates garbage downstream. Three hops later, the output is coherent-looking but factually wrong.

The final pipeline output contains errors that originated at hop two, with no per-span scoring to identify where the drift started.

Tool-choice cascade

In a sequential crew, Agent B selects the wrong tool based on Agent A’s output, which was itself a low-confidence generation. Each agent’s error is individually small; compounded across the sequence, the final output is plausible-sounding but incorrect. No single hop fails hard enough to surface a visible error.

A multi-agent pipeline that produces wrong outputs confidently, with no per-hop signal to trace which agent in the sequence introduced the error.

State poisoning via corrupted checkpoint

LangGraph checkpointing saves a corrupted intermediate state from a partially-failed run. On retry, the graph resumes from the corrupt checkpoint rather than from a clean state. The same failure mode fires on every subsequent run against the same thread — the bug is now baked into the graph state, not just the current execution.

Every retry on the same graph thread fails identically; the corruption is permanent until the checkpoint is manually cleared, which requires knowing which checkpoint is bad.

Undetected infinite loop

A cycle condition in the graph is met on unexpected input. The pipeline runs until timeout, consuming tokens and accruing cost at every hop, with no output. The failure mode is invisible until the bill arrives or a timeout error surfaces — there is no signal during the run that the loop is not converging.

Unbounded token spend and latency; no output delivered; the failure requires post-hoc log review to trace which cycle condition triggered.

Eval + control loop

What happens when a rule fires

The response

How TruLayer closes the loop

JSON Schema
Tool Choice
Function Call
Faithfulness

Multi-agent pipelines fail differently than single-agent pipelines: errors compound multiplicatively across hops, and the final output gives you no indication of where in the sequence the failure originated. TruLayer instruments each hop as a separate span. The JSON schema conformance evaluator scores whether each sub-agent’s output matches the schema the orchestrator declared it expects — the subagent schema drift failure is a JSON schema conformance failure at hop two, and it scores at hop two, not at the final node where the corrupted output becomes visible. The tool-choice correctness evaluator scores whether each agent in the crew selected the correct tool for its assigned task. Both run inline on every span across every hop. When a rule fires at hop two, the control loop acts before hop three receives corrupted input — the next execution of the same graph path gets corrected input at the point where drift started, not a retry of the full pipeline from scratch.

The faithfulness evaluator adds a layer that JSON schema conformance alone misses: schema conformance checks structure, but faithfulness checks whether the sub-agent’s output is grounded in the context it was passed. A sub-agent can return a schema-valid response that fabricates its content — the JSON parses cleanly, but the data is invented. For multi-agent pipelines that move information across hops, faithfulness scoring at each span is the mechanism that ensures each agent’s output traces back to what it was actually given, not to free generation. The function-call correctness evaluator closes the last gap: at each hop, it scores whether the tool was called with valid arguments, catching the failure mode where Agent B calls the right tool but with malformed parameters derived from Agent A’s low-confidence output.

The per-trace view surfaces every hop as a labeled span with its own eval scores. For an engineering team debugging a LangGraph pipeline, this is the difference between "the final output was wrong" and "hop two had a JSON schema conformance score of 0.0, the faithfulness evaluator at hop three fired at 0.28, and the retry at hop three resolved it." The chain of custody is in the trace. When a control-loop action fires, it targets the specific hop that failed — not a blanket pipeline retry that re-runs every agent from the start and obscures whether the fix worked at the right point.

See it in practice

Instrument your multi-agent orchestration agent in two lines.

Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.

agent.ts

import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'

const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())

// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the orchestration project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: task }],
})

Ship reliable multi-agent orchestration.

Free tier includes 1M spans / month · No credit card

Start free Read the integration guide