Use case
Voice AI Agents
Your voice agent misclassified the caller’s first utterance and entered the upsell flow on a frustrated customer calling to cancel. The caller hung up. The call was marked completed. No eval fired. The same misclassification will run on the next caller in the same situation unless something closes the loop.
Where things go wrong
ASR-cascade misclassification
Background noise degrades the ASR transcript — "cancel" becomes "can sell." The downstream intent classifier picks the upsell flow. The caller, already frustrated, is offered a discount on the service they called to cancel. The call ends early; the agent reports a completed interaction.
A cancellation caller is pushed further toward churn by the wrong call flow; the misclassification is invisible in the call log unless the NLU output is scored.
Interruption misfire
The agent’s turn-detection model fires on a trailing-comma pause and cuts the caller off mid-sentence. The caller repeats themselves; the agent cuts them off again. Trust breaks in the first thirty seconds of the call. Conversion drops; the call ends without the agent completing its task.
Premature turn-taking destroys the conversation flow; the failure is logged as a short call, not as a model error.
Multi-turn context collapse
In a long appointment-booking call, the agent loses track of a slot filled three turns ago — the caller’s preferred date — and asks them to repeat it. The caller’s frustration signal spikes. The agent does not detect the change in sentiment and continues in the same tone.
Caller drops off before booking completes; the partial session has no record of what information was already collected or where the conversation state broke down.
Wrong tool call on booking
The agent calls the `create_appointment` tool with a timezone offset error — the caller said "3pm" in their local time, the agent booked 3pm UTC. The downstream calendar invite fires before anyone reviews it. The patient or customer shows up at the wrong time.
A missed or wrong-time appointment; the error propagates silently through the calendar system with no eval score on the tool-call arguments to flag it.
Eval + control loop
What happens when a rule fires
The response
How TruLayer closes the loop
- Tool Choice
- Function Call
- Multi-Turn Consistency
- Sentiment Match
Voice agent failures cluster around two surfaces: the tool call and the conversation state. The tool-choice correctness evaluator scores whether the agent selected the right call-flow action for the intent it detected — it catches the ASR-cascade misclassification at the point where the wrong flow was selected, not in the recording after the call ends. The function-call correctness evaluator scores whether the booking tool was invoked with the right arguments — the right date, the correct timezone resolution, the valid slot ID. Both run inline on every span. A low score on either evaluator fires the control loop before the next call on the same failure path reuses the same wrong intent mapping or the same malformed booking argument.
The multi-turn consistency evaluator closes the gap that tool-call scoring alone misses: it tracks whether the agent’s outputs stay internally consistent across turns — whether a slot filled three turns ago is still present in the agent’s state when it is needed. Conversation collapse in long calls is invisible to per-span scoring unless the evaluator looks across the turn sequence. For a booking agent handling four-to-six-turn conversations, multi-turn consistency scoring is the difference between knowing the agent dropped a slot value and only knowing the call ended without a booking. The sentiment-match evaluator runs alongside this: when a caller’s tone escalates from neutral to frustrated — a signal measurable from the ASR transcript — and the agent continues responding in a neutral or positive register, the mismatch scores low. The control loop can route the next call in the same sentiment-escalation pattern to a human operator before the conversation reaches the hang-up point.
When a rule fires, the control loop acts before the next call in the same failure class repeats. For a voice pipeline, "next call" means the next caller who triggers the same intent path or the same turn-depth where context collapse was detected. The action choices are the same three available across all TruLayer deployments: retry with a corrected prompt that explicitly re-anchors the agent’s state to the filled slots; route the call type to a human operator queue for review before the same scenario auto-routes again; or fall back to a more conservative model with lower creative latitude on intent inference. The per-trace before/after delta surfaces the original score and the post-remediation score side by side — for a voice AI engineering team, this is the tool that turns "calls are failing" into "the ASR-to-NLU handoff on this intent class has a 0.33 tool-choice score and every retry since the last rule update has not resolved it."
See it in practice
Instrument your voice ai agent in two lines.
Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'
const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())
// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the voice project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: task }],
})Ship reliable voice ai agents.
Free tier includes 1M spans / month · No credit card