Back to blog

Building TruLayer with an AI agent team

April 29, 2026 · Wei Hai


Before I shipped a single line of TruLayer to production, the system that builds TruLayer almost shipped something that would have tanked our email deliverability for months.

Here is what happened. I was setting up the marketing agent — one of twelve Claude Code agents in the team that built this product — and asked whether it should have permission to send cold outreach emails autonomously. The agent pushed back immediately. Not with a vague "I'm not sure about this." With a specific, reasoned refusal: if you send cold outbound from the same Resend domain you use for transactional email (password resets, onboarding sequences, billing notifications), you risk getting that domain's sender reputation downgraded. Once a transactional domain is flagged as cold-outbound-adjacent, inbox placement for your most important emails degrades permanently. The fix requires a separate sending domain, a multi-week warmup period, and potentially a new IP. It's a months-long hole to dig out of.

The marketing agent file — the document that defines what that agent is allowed to do — now has this as a hard rule: it does not send email. Not via Resend, not via any API. Every draft goes to me for review and send from my own client. That's not a soft preference. It's load-bearing architecture.

That moment is the whole point of this post.


What TruLayer is

TruLayer is the reliability platform for production AI agents. It does three things together that you can't get from any single tool today: trace every step of a multi-agent workflow, evaluate every output inline with LLM-judge scorers, and act on failures automatically — retry with a fallback model, modify the prompt, trigger a webhook — before your user sees the broken response. Observe, eval, remediate, in one closed loop.

I built it because every observability tool I tried showed me exactly what broke and offered me nothing automatic to do about it.


Why an agent team

I'm a solo founder. The alternative to this approach was roughly 18 months of solo engineering across a Go API server, a Next.js dashboard, a TypeScript SDK, a Python SDK, a marketing site, public docs, infra-as-code, CI/CD, and a launch campaign — or funding a 6-engineer team, which I didn't have. Neither was viable on the timeline I needed.

What was viable: a structured team of AI coding agents, each with a narrow, well-defined scope, operating under a shared context system, with me as the supervisor and decision-maker — not the typist.

I want to be specific about what "supervised" means here. I didn't run agents and accept their output. I read every PR, pushed back on specs, rejected implementations that drifted from the acceptance criteria, and caught the places where agents made assumptions I hadn't authorized. The agents handled the volume of implementation work. I handled the judgment calls. That division of labor is the reason this worked.


How the system is structured

The team has twelve agents. Each is defined by a markdown file under .claude/agents/ in the monorepo. That file specifies the agent's name, the model it runs on, its scope, its hard rules, its working directories, its Definition of Done, and how it hands off to other agents. These files are loaded into context on every session start via CLAUDE.md files at every level of the directory tree.

The agents, roughly in order of who touches what:

  • tpm — sprint planning, cross-repo orchestration, Linear issue sync. Dispatches all other agents. Never writes application code.
  • pm — PRDs, user stories, acceptance criteria, marketing site copy, blog posts. That's me, in agent form.
  • qa — test strategy, Definition of Done gate. Derives test cases from acceptance criteria. Does not write features.
  • security — vulnerability triage and hardening. Dispatched by tpm when a backend or infra PR touches auth, secrets, or the threat model.
  • staff-backend-a / staff-backend-b — co-own the Go monorepo (API server, Kafka consumer), the TypeScript SDK, the Python SDK, and the runnable demo repos. Non-overlapping assignments to prevent git conflicts.
  • staff-frontend-a / staff-frontend-b — co-own the Next.js dashboard and the marketing site. Same non-overlap rule.
  • devops-a / devops-b — deployment pipelines, Railway services, Vercel config, Cloudflare Workers, secrets management, CI/CD. Everything in infra/.
  • dx — public developer docs at docs.trulayer.ai. Triggered whenever a backend or SDK change affects the public developer surface.
  • marketing — outbound research, prospect list, cold email drafts, social post drafts. Research-and-drafts only.

The routing rule is: work comes in to tpm, tpm dispatches to the right agent(s), agents return artifact paths and the Linear issue they updated. Agents do not hand off directly to each other except through tpm. This keeps the dependency graph legible and prevents an agent from silently taking on scope that belongs to another.

Linear is the source of truth for what's being built. Every spec lives in a Linear issue body. Every implementation task maps to a Linear issue. The agents read and write Linear via the MCP tools. When tpm and an engineer agent disagree about scope, Linear wins.


The hard rules

The most important thing in each agent file is not the description of what the agent does — it's the explicit list of what it does not do.

The marketing agent's hard-rules section opens with three non-negotiables:

- You do not send email. Not via Resend, not via SMTP, not via any API
  or CLI. You have no email-sending credentials and you do not request
  them. Every email goes through the founder's own client.
- You do not post to LinkedIn, X, Product Hunt, Hacker News, YC
  Bookface, Discord, Slack, Reddit, or any external channel. You
  produce the draft; the founder posts.
- You do not initiate outreach via any direct channel — no DMs, no
  LinkedIn connection requests, no GitHub messages. Research is
  read-only.

...and four more rules of the same shape.

The email rule exists because of the specific deliverability failure mode described at the top of this post. The "no posting" rule exists because a draft that ships without founder review is a brand decision made without a human in the loop. These aren't theoretical risks — they're the specific ways an unconstrained marketing agent causes real production damage.

Here is the DX agent's boundary:

# What you do NOT do

- Do not edit application code in backend/, frontend/, or any SDK repo
- Do not invent API behaviour — if a doc requires a feature that
  doesn't exist yet, flag it to the tpm
- Do not merge your own PR without CI passing
- Do not document internal tooling, Railway config, or infrastructure

The "do not invent API behaviour" rule is the one that matters most. An agent writing docs for a feature that doesn't exist yet would create a class of developer trust failure that's very hard to walk back — you can correct a typo, but you can't easily rebuild trust with a developer who integrated against a capability that turned out to be fiction. The rule forces the agent to surface the gap as a spec issue rather than paper over it.

The devops agents (devops-a and devops-b) co-own deployment and secrets management, and carry a matching hard rule: no secrets are committed to any file — not to .env.example, not to any committed config. All secrets live in Railway, Vercel, or GitHub Actions secrets. The Definition of Done for every devops task includes a "no secrets committed" check as an explicit gate.

A third example: when I asked the pm agent to draft my LinkedIn About section, it wrote credibility framing around my time at Reddit, ServiceNow, and Zynga — but explicitly flagged in the artifact that it had not invented specific anecdotes about LLM work at those employers, because I hadn't confirmed any such anecdotes existed. A note in the file reads: "Do not add specific anecdotes about any employer's internal systems without confirming with Wei first." That's an agent maintaining honesty when fabricating would have made the copy read better on its face. That's the right failure mode.


The orchestrator: why tpm makes parallelism possible

The routing rule I described above — all work flows through tpm, never agent-to-agent directly — sounds bureaucratic until you try running without it.

TPM's job is not to manage; it's to set the dependency graph. When I gave the team a new initiative — for example, drafting the launch plan for our April 2026 go-to-market push — tpm broke the work into eight Linear issues (TRU-341 through TRU-348) and routed them across pm, dx, and devops in explicit dependency order: positioning doc before launch blog, launch blog before announcement email, infrastructure gates before any go-live date was published. Agents didn't negotiate that order with each other. TPM set it and the agents worked their queue.

The harder part is what tpm holds back. When a request would pull the team into expensive, low-leverage work — building Helm charts for self-hosting before any enterprise customer asked for it, standing up regional data planes before a paying EU customer made it a contract condition — the tpm's job is to push back, move the issue to a later milestone, and leave a comment in Linear explaining the call. That's a deprioritization authority built into the agent file explicitly. Not "maybe defer this" — "you have permission and the obligation to deprioritize this and explain why."

The result is that the team can run multiple streams in parallel without generating chaos. Staff-backend-a and staff-frontend-a can be in-flight simultaneously because tpm ensured their work didn't share a dependency boundary. When it does — when a backend contract change would break the frontend — tpm serializes them, not me.


The gatekeeper: why the agent that says no matters

Every engineering team has someone who slows things down in the right ways. On this team it's qa.

QA's role is the Definition of Done gate: for any task in backend/ or frontend/, the task is only done when both unit tests and integration/E2E tests land in the same PR as the feature code. Not in a follow-up. Not as a separate issue. In the same change.

This rule is in the qa agent file. It's in the root CLAUDE.md. It's in every sub-repo CLAUDE.md. It's in the user memory that persists across sessions. The redundancy is intentional — the rule needs to be load-bearing regardless of which agent is reading context at the time.

The qa agent's file is explicit about what "coverage" means: it's not that tests exist, it's that they would fail if you mutated the feature code. A test that passes for the wrong reason is worse than no test. QA checks acceptance criteria against test cases one-to-one, names specific gaps ("error path on line 47 has no test"), and returns a verdict of "ready to close" or "blocked: reason." TPM does not close a Linear issue without that verdict.

I have asked, more than once, whether we could defer tests on a small change and add them in a follow-up. The answer from the qa agent is always the same: if the task is too big to ship with tests in one go, propose the split before writing code. That is the right answer. It's also the answer that's easy to override when you're in a hurry, which is exactly why it's enforced by an agent with an explicit mandate rather than by a convention in a document nobody reads.

The production parallel is exact: TruLayer's eval rules are the same class of mechanism — a configured gate that runs on every span, not a post-hoc audit.


The backend pair: coordination at the git layer

Staff-backend-a and staff-backend-b co-own the same Go monorepo. Running them in parallel means two agents targeting the same working directory, and early on that caused the obvious problem: concurrent git stash, checkout, and branch-switch operations would overwrite each other's staged changes. An agent that had accumulated an hour of work could lose it to a single git checkout from its parallel counterpart.

The fix is mandatory worktree isolation. When tpm dispatches two backend agents to work concurrently on the same repo, each gets a separate git worktree — a distinct on-disk checkout of the same repository. They share the git object store but operate on independent working trees, so branch operations in one have no effect on the other. This is now a hard rule in the root CLAUDE.md: use isolation: "worktree" when dispatching parallel agents to the same repo, and the agent dispatching instructions require the worktree to be removed and pruned after the PR merges.

The production analog is the same coordination failure at a different layer: two agents writing to shared state without coordination. In a production multi-agent workflow, this produces wrong output that's hard to attribute. TruLayer traces every agent hop, so when it happens you can see exactly which agent wrote what and in what order. The dev-time worktree rule and the prod-time trace are solving the same problem with the same underlying logic: isolate the working state, make conflicts visible, clean up after the work is done.


The through-line: these are the same failure modes

Here is what I noticed building this system: every engineering decision I made about the agent team maps directly to a class of failure that TruLayer catches in production AI applications.

Scope drift. Without explicit scope rules in the agent files, agents would over-reach into each other's surfaces — a backend agent editing frontend components, a docs agent inventing API behavior, a marketing agent sending email. In a production LLM application, the equivalent is an agent that takes actions outside its intended task — tool calls it was never supposed to make, outputs it was never supposed to produce. TruLayer's eval rules detect this as the control-loop equivalent of a scope rule: define what outputs are acceptable, score every output against those rules, act when something fires.

Hallucination and faithfulness. The pm agent refused to invent LinkedIn anecdotes because doing so would produce output that wasn't grounded in confirmed facts. In production agents — a RAG assistant, a customer-support bot, a coding agent — hallucination is the same failure mode at higher stakes. TruLayer's hallucination and faithfulness evaluators run inline on every span, scoring whether the output is grounded in the retrieved context and whether it matches the source material. The dev-time version of this is "the agent file says don't invent"; the production version is "the eval rule says flag ungrounded claims."

Destructive actions without constraints. The marketing agent has no email-sending credentials because the failure mode of autonomous email dispatch is worse than the inconvenience of a manual step. The same logic applies to any production agent that can write to a database, send a notification, charge a payment method, or execute arbitrary code. TruLayer's control loop is the runtime layer of this constraint: if an eval fires indicating an agent is about to take a destructive action outside its expected behavior, the loop can block the action, trigger a fallback, or surface it for human review.

Coordination failures. The worktree isolation rule — both backend agents get separate git worktrees when working concurrently — came from experience: before we added it, concurrent agents would corrupt the working directory and lose each other's changes. The production analog is multi-agent workflows where two agents write to the same state without coordination. TruLayer traces every agent hop in a multi-agent workflow, so when a coordination failure produces a bad output, you can see exactly which agent produced it and at which step.

The pattern is not "AI is dangerous, be careful." The pattern is: the engineering discipline that makes a multi-agent dev workflow safe and reliable is exactly the same discipline you need to make production LLM agents reliable. Scope rules, grounding constraints, action blockers, coordination primitives. TruLayer is the runtime layer of that discipline — the part that runs against your production agents after you deploy them.


What didn't work

I want to be direct about the failures, because posts that only describe what went well are not useful.

Overlapping scope caused agent conflict. Early versions of the agent files had fuzzy boundaries between the pm agent and the marketing agent. Both could touch positioning copy. Both could write outreach drafts. The result was agents producing contradictory artifacts on the same surface. The fix was the hard rule: pm owns the marketing site and positioning doc; marketing agent consumes those but does not modify them. A single owner per surface is non-negotiable.

Agents over-edited each other's work. Before the "do not modify" rules were explicit, an agent asked to improve one surface would sometimes reach into an adjacent surface it considered related. A frontend agent tidying up copy might quietly change a CTA label that pm had carefully chosen. These were small edits individually; cumulatively they introduced semantic drift across the product surface. The rule now is explicit per agent: here is the directory you own, here is the directory you read-only, here are the directories you never touch.

Shared working directories caused git conflicts. When staff-backend-a and staff-backend-b were dispatched to work concurrently on the same Go repo, they would conflict at the git level — one agent's git stash or checkout call would overwrite the other's staged changes. This cost one agent an entire session's worth of work before we caught the pattern. The fix was mandatory worktree isolation, now codified in the root CLAUDE.md as a hard dispatch rule.

These are solvable problems. They're also the kind of problems you don't know to solve until you hit them.


Try it

TruLayer is live. Free tier: 1M spans/month, 1K eval runs, no credit card required.

Instrumentation is two lines:

const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())

Every call is now traced, evaluated, and monitored. The control loop is yours to configure.

If you're building production AI agents and the failure modes in this post sound familiar, I'd like to hear what's broken in your stack.