Use case

AI Email Assistants

Your AI email assistant included negotiation terms from a different thread in a cold outreach reply — same sender domain, different contact. The draft went to the wrong person before anyone reviewed it. The agent’s retrieval window pulled the wrong thread. Nothing in the pipeline scored whether the draft was grounded in the right context.

Where things go wrong

Cross-thread context leak

The agent’s retrieval window includes a span from a different conversation — same sender domain, different contact. Confidential terms from a prior negotiation appear in a draft addressed to a cold prospect at the same company. Auto-send fires before the user reviews.

Confidential information disclosed to an unintended recipient; the exposure is attributable to the agent’s retrieval logic, not to any user action.

Wrong-recipient draft

Two contacts share a similar name in a cluttered inbox. The agent addresses a sensitive reply to the wrong person — the name is plausible from the thread context, but it is the wrong contact. The draft is sent before the user confirms the recipient.

A sensitive reply reaching the wrong person; depending on the content, the consequence ranges from embarrassing to a compliance or legal exposure.

Unsubscribe handling failure

The agent processes an "unsubscribe" reply as a request for more information — the intent classification reads the word "unsubscribe" as engagement rather than opt-out. A follow-up sequence fires. The domain is now at risk of a CAN-SPAM or GDPR violation.

Regulatory exposure and burned sender reputation; the failure is traceable to an intent misclassification that a sentiment or faithfulness evaluator on the incoming message would have caught.

Tone misclassification

The agent classifies an angry complaint email as a routine inquiry based on surface features — the sender’s writing is calm and formal, not overtly hostile, but the content is an escalation. The agent responds with a standard-issue reply at the same register it uses for routine tickets. The escalation reaches the wrong handler.

A complaint that needed immediate escalation to a senior handler instead goes through a standard queue; the delay compounds the customer’s frustration and makes the resolution harder.

Eval + control loop

What happens when a rule fires

The response

How TruLayer closes the loop

PII Leakage
Multi-Turn Consistency
Sentiment Match
Faithfulness

For AI email assistants, the failure modes that cause the most harm share a common structure: the wrong context informed the draft. The PII leakage evaluator catches cross-thread contamination — when a draft contains identifiable information about a contact or conversation that does not belong in the current thread, the evaluator scores the span and the control loop routes the draft to a human review queue before it is sent. The multi-turn consistency evaluator catches wrong-recipient and context-bleed failures at a different level: it scores whether the drafted reply stays consistent with the actual thread it is responding to, not with a retrieved thread from a different conversation. The faithfulness evaluator closes the remaining gap — it scores whether the draft is grounded only in the current thread’s context, flagging cases where retrieved content from an unrelated conversation bleeds into the reply. All three run inline on every drafted span as each trace arrives.

The sentiment-match evaluator addresses the tone misclassification and unsubscribe mishandling failure modes. It scores whether the agent’s drafted tone calibrates correctly to the detected sender sentiment — a calm-but-escalating complaint should produce a different draft register than a routine inquiry, and the evaluator catches the mismatch. For unsubscribe handling, the faithfulness evaluator on the incoming-message intent classification catches the case where the agent’s understanding of the incoming message does not match what the message actually said. When any rule fires, the control loop acts before the next draft on the same failure class auto-routes: retry with a prompt that re-anchors the agent to the specific thread context; route the draft type to a human review queue for send confirmation; or fall back to a more conservative drafting model with lower retrieval latitude. For an email AI with auto-send enabled, the HITL queue is the critical path — a human confirmation step before send is the appropriate default for any draft class that has produced a PII leakage or wrong-recipient failure.

See it in practice

Instrument your ai email agent in two lines.

Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.

agent.ts

import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'

const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())

// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the email project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: task }],
})

Ship reliable ai email assistants.

Free tier includes 1M spans / month · No credit card

Start free Read the integration guide