Use case
AI Data Analysis Agents
Your AI analytics agent defined "monthly active users" using the wrong timestamp column. The number is 40% higher than your real MAU. The chart looks right. The board deck has it. You find out when someone asks a follow-up question the data cannot answer.
Where things go wrong
Schema misread (deprecated column join)
The agent generates SQL that joins on a column that exists in the raw schema but has been renamed in the semantic layer. The query runs and returns rows — the database does not reject it. The result is wrong, and it looks plausible enough to pass a spot-check by a non-technical reviewer.
A report built on incorrect joins propagates wrong numbers into dashboards, decks, or automated alerts before anyone runs a manual query to verify.
Hallucinated metric definition
The agent defines a business metric — monthly active users, conversion rate, average order value — using a column that is semantically adjacent but not the canonical definition. The number is internally consistent; it just measures the wrong thing. The error is invisible without knowing what the canonical definition should have been.
A metric that is 20–40% off from the real value enters a board deck or investor update; the error requires a manual audit of the query logic to find.
Stale context grounding
The agent retrieves schema documentation from a cached embedding index that has not re-indexed after a table rename. It generates SQL against a table that no longer exists by that name. The error is swallowed by a fallback that returns empty results labeled as "no data found" — a phrase that to a business user implies a data problem, not a model error.
A query that should return results silently returns empty, is misread as a real signal ("no activity in this cohort"), and influences a business decision.
Insight fabrication on empty results
When a query returns zero rows, the agent generates a plausible trend narrative rather than reporting null. Its system instruction says "always provide an insight" — so it does, from training data rather than from the actual query result. The insight is not flagged as fabricated; it is formatted exactly like insights that came from real data.
A fabricated narrative enters a slide deck or automated report; the fabrication is undetectable without running the underlying query independently.
Eval + control loop
What happens when a rule fires
The response
How TruLayer closes the loop
- Faithfulness
- Hallucination
- Groundedness
- JSON Schema
For AI data analysis agents, the failure modes split across two evaluators. The faithfulness evaluator scores whether the generated SQL and the resulting narrative are grounded in the actual schema documentation and the retrieved query results. A low faithfulness score on a metric definition means the agent’s definition of "monthly active users" or "conversion rate" cannot be traced to the canonical metric definitions in its grounding context — it invented or inferred the definition rather than reading it. A low faithfulness score on a narrative means the insight does not trace to what the query actually returned. Both fire inline on every span as each trace arrives, covering the full pipeline from query generation to narrative output.
The hallucination evaluator catches the complementary failure mode that faithfulness alone does not cover: the metric definition is not just unfaithful to the context but is factually fabricated — a trend claim, a cohort insight, or a data narrative that was not produced by any query result. The insight fabrication failure — where the agent generates a trend narrative when the query returned zero rows — is a hallucination: an assertion about the world that has no grounding in any retrieved context. The groundedness evaluator adds a third layer: it scores whether the narrative output stays within what the query results actually contain, rather than drifting into inferences the data does not support. The JSON schema conformance evaluator covers the structural side: for agents that emit chart configurations or structured metric objects, it scores whether the output conforms to the declared output schema so the downstream visualization layer receives valid data.
When a rule fires, the control loop acts before the next query on the same failure path repeats the same schema misread or metric fabrication. The action choices are: retry with a prompt that explicitly names the correct column mapping or the canonical metric definition from the semantic layer; route the output to a human data reviewer before it enters the dashboard or report; or fall back to a more conservative model with lower generative latitude on metric interpretation. The per-trace before/after delta shows exactly which span produced the failing score — was it the SQL generation span, the narrative generation span, or both — and whether the retry resolved it. For a data platform team, this is the tool that turns "the AI gave us a wrong number" into "the faithfulness evaluator on the MAU query span scored 0.31 at 9:42am, the retry corrected the column join, and the revised score was 0.91."
See it in practice
Instrument your ai data analysis agent in two lines.
Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'
const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())
// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the analytics project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: task }],
})Ship reliable ai data analysis agents.
Free tier includes 1M spans / month · No credit card