Back to blog

Orchestration patterns for agentic dev — what we learned shipping a product with AI agents

April 30, 2026 · Wei Hai


TruLayer v0.1 was built by a team of AI agents — 400 sub-tickets across 20+ production features, closed in 10 days of wall-clock time. The interesting story is not that AI can write code. The interesting story is the orchestration: the specific patterns that turn a collection of capable agents into a system that produces shippable output rather than a pile of partially-merged PRs.

I wrote about what TruLayer is and what shipped in a separate post. This post is about how we built it — specifically the seven patterns that made the sprint work. These are not theoretical. They are things we learned by shipping through failure modes we didn't anticipate.

If you are building agentic dev workflows — or if you are building production AI agents and want to understand the failure modes before they hit you — these patterns apply directly.


1. Specialize the agents

The most common agentic-dev mistake I see is the single-agent-does-everything pattern: one large context, one system prompt, one agent that writes code and tests and docs and manages Linear tickets and reviews its own PRs. The appeal is simplicity. The failure mode is drift.

Generalist agents drift mid-task. They pick up scope that wasn't assigned. They "helpfully" fix things adjacent to the thing they were asked to fix. They de-prioritize their own acceptance criteria when a shiny adjacent problem appears in their context. The longer the context, the worse this gets — a generalist agent in a 50,000-token context is fighting to stay on-task in a way a specialist never has to.

Our agent team has eleven agents. Each is defined by a markdown file in .claude/agents/ that specifies: name, model, scope, tools available, hard rules (what it explicitly does not do), working directories, and a definition of done. A backend engineer agent can read the full codebase but is only authorized to write to backend/, the TypeScript SDK, the Python SDK, and the demo repos. A QA agent can read everything and write tests, but cannot merge PRs or update specifications. A security agent's only job is reading PRs through a specific threat-model lens.

Here is the structure we use for each agent file, abbreviated:

# [agent-name]

## Role
[One-sentence description of what this agent does.]

## Scope
- Owns: [list of directories/surfaces]
- Read-only: [list]
- Never touch: [list]

## Hard rules
- [List of explicit non-negotiables — things this agent does not do
  regardless of what it's asked]

## Definition of Done
[Per-agent DoD that the agent self-checks before reporting done]

The hard rules section is more important than the role description. Every agent file opens with what the agent does not do. The marketing agent's hard rules open with: "You do not send email. Not via Resend, not via SMTP, not via any API or CLI." The docs agent's hard rules include: "Do not invent API behavior — if a doc requires a feature that doesn't exist yet, flag it to the tpm." These constraints are not polite suggestions; they are the lines where an agent refusing to cross is protecting the system.

The specialization pays off in exactly the moment where it costs the most to fail. The security agent caught a critical auth-gate bug in PR #323 specifically because security review is its only job. The endpoint GET /v1/traces/:id/remediations had been registered without the required authentication and plan-gate middleware — which would have allowed Starter-plan API keys to read encrypted prompt content from any trace in the system. The security agent caught it before review, opened a fix PR, and left a detailed comment on the original before anyone else had looked at the PR. A general-purpose backend engineer agent would have reviewed its own code and shipped past it. Specialization is the reason it didn't.


2. Worktree isolation for concurrent agents

If you run two agents in the same git working directory at the same time, you will have a bad time. The failure mode is not dramatic — it's subtle. One agent runs git stash, picks up the other agent's unstaged changes, proceeds to work on top of them. The second agent opens a PR with an empty diff. An hour of implementation work has silently evaporated.

The fix is git worktree. Each agent working concurrently on the same repo gets its own on-disk checkout via git worktree add. They share the git object store — no duplication of history — but their working trees are independent. A branch operation in one worktree has zero effect on the other.

# Dispatch pattern: each agent sets up its own worktree first
git -C /path/to/repo worktree add /path/to/worktree-a feat/agent-a-task
git -C /path/to/repo worktree add /path/to/worktree-b feat/agent-b-task

Cleanup is part of done. If a worktree is not explicitly removed after the PR merges, it blocks future git checkout operations and confuses the object store's reference tracking. The DoD for every agent task now includes:

git -C /path/to/repo worktree remove --force /path/to/worktree
git -C /path/to/repo worktree prune

There is a specific wrinkle we had to solve during this sprint: the parent monorepo at /Users/weihai/git/trulayer/ is a filesystem namespace grouping real git repos — it is not itself a git repo. The isolation: "worktree" flag in our dispatch configuration assumes the monorepo root is a git repo and tries to create the worktree there. It fails silently. The fix was for each backend engineer agent to set up its own worktree manually as the first step of every dispatch. We will automate this in the agent infrastructure, but the manual step is reliable enough to ship with.

The mental model here is the same one production multi-agent systems need: shared state with uncoordinated writes produces inconsistent results. Worktree isolation is the dev-time version of the coordination primitive that TruLayer's trace model gives you in production — each agent operates on its own isolated state, and the merge happens at a controlled point.


3. Babysit-merge agents

Engineering agents reliably push PRs and then die before CI completes. This is not a bug in any specific agent — it is a predictable consequence of long agent runtimes hitting token or session limits. The agent has done its job: it implemented the feature, wrote the tests, opened the PR. But CI takes 3-7 minutes, and the agent's context window is exhausted before the checks go green.

The PR sits unmerged. The next dispatch picks up a different task without noticing. The open PR blocks the sprint.

The wrong fix is trying to make agent runtimes longer. The right fix is a separate dispatch whose only job is to watch CI and merge once green. I had three of these running simultaneously by the end of the sprint:

# Babysit-merge agent prompt (simplified):
# Your only job: wait for CI to complete on PR #N, then merge.
# Command: gh pr merge <PR-number> --squash --delete-branch
# If CI fails, report the specific failure. Do not attempt to fix it —
# report the failure back and stop.

The babysit-merge agent is inexpensive — minimal context, simple job, short runtime. It does not need the full codebase in its context. It just needs the PR number and the merge command.

This pattern should be baked into the standard dispatch workflow from the start, not invented mid-sprint. Every engineer agent dispatch should automatically spawn a babysit-merge counterpart as part of the dispatch definition. The overhead is low; the cost of skipping it is unmerged PRs that accumulate until a human notices.

One additional rule the babysit-merge agent carries: it does not fix CI failures. It reports them and stops. This boundary matters. An agent that merges with a fix in scope will start pulling in changes outside its context, and you end up with a "merge agent" that has quietly become an engineer agent without the right context to do the job safely.


4. Trust but verify any "blocked" or "missing" claim

This is the pattern I would have benefited most from knowing before the sprint.

During the sprint, the QA agent confidently reported that a specific database migration file was absent from the migration index — a blocker, it said, that would cause the schema migration system to fail on deploy. The report was detailed. It named the specific file: 20260430200000_idx_api_keys_revoked_at_null.sql. It explained exactly why the absence was a problem. It was the kind of clear, well-reasoned, citation-backed finding that you want from a QA system.

I checked the actual file before acting. The file was there. The actual filename was 20260430230000_idx_api_keys_revoked_at_null.sql — a three-hour timestamp difference — and it had been correctly added to the migration index. The QA agent had hallucinated the blocker with the same calm certainty it uses to describe real ones. If I had acted on the report — reverted the migration, opened a hotfix PR, delayed the deploy — I would have wasted at minimum an hour and potentially introduced a real problem in the process.

The defense is a hard rule: verify before acting on any "broken / blocked / missing" claim from an agent. The verification cost is small: look at the actual file, run the actual check, read the actual log. The cost of acting on a hallucinated blocker is a wasted dispatch, a delayed launch, and — in the worst case — a real regression introduced to fix a problem that never existed.

Agents do not announce uncertainty. They do not say "I'm not sure about this" more often when they are hallucinating than when they are correct. The confidence calibration is broken in exactly the way that makes it dangerous: the agents are most dangerous when they sound most certain. Build verification into your workflow as a hard step, not a judgment call you make based on how confident the agent sounded.

The corollary: never give a "blocked" or "missing" report agency to halt the whole sprint. Triage it like a bug report: confirm the symptom is real before escalating the fix.


5. Direct-API fallback for tool outages

The Linear MCP — the tool agents use to read and write Linear tickets — failed three times during the sprint. The error was net::ERR_FAILED. No graceful degradation, no retry, just a hard failure that stopped the agent's ability to update issue status, read acceptance criteria, or post comments.

"The tool is down" is not an acceptable reason to block work that has a clear API alternative. Every agent that depends on a critical external tool should have the raw API fallback documented in its prompt context. For Linear, that means:

# Direct GraphQL fallback when Linear MCP is unavailable
curl -s -X POST https://api.linear.app/graphql \
  -H "Content-Type: application/json" \
  -H "Authorization: $LINEAR_API_KEY" \
  -d '{
    "query": "mutation { commentCreate(input: { issueId: \"<issue-id>\", body: \"<comment>\" }) { success } }"
  }'

The agent needs: the GraphQL endpoint, the API key in its context, and the specific query for the operation it needs. The MCP is a convenience wrapper around an API that has been stable for years. When the convenience layer breaks, the stable layer is right there.

The same principle applies to any tool that sits on a network seam. If the tool can be replaced by a direct HTTP call when it goes down, document that call in the agent's context before you need it. You will need it at the worst possible moment — mid-sprint, time pressure, no patience for debugging MCP connectivity — and "fall back to the raw API" needs to be a three-second operation, not a debugging exercise.

Extend this principle: identify the two or three tools each of your agents is most dependent on. For each one, document the fallback. Put it in the agent's system prompt, not in a runbook that the agent can't access.


6. Pre-merge security review against a documented checklist

Vague security review produces vague results. "Check for security issues" is an instruction that will return a paragraph about SQL injection and CSRF that has nothing to do with the actual PR.

The security agent operates on a written threat model — a specific list of items that every PR touching auth, secrets, or data-access is checked against. The list has items A through K. Each item is specific enough to produce a binary answer:

  • A — TenantScope coverage: does every DB query filter by team_id? Are there any raw queries that bypass the tenant-scoped wrapper?
  • B — Auth gate present: does every new endpoint have RequireClerkAuth middleware? Does any endpoint that returns tenant data also check the plan gate?
  • C — No logged credentials: does any log call include tokens, secrets, API keys, or PII?
  • D — IDOR resistance: can a Starter-plan API key reach resources belonging to a different team by guessing a UUID?
  • ...and seven more items of the same specificity.

Every PR gets a security review comment before merge. The comment covers each checklist item with a verdict: pass, fail, or not-applicable. This creates a record — every PR has a documented security review, and the review is anchored to specific criteria rather than the agent's general sense of what "secure" means.

This is what caught the auth-gate bug. Item B — "does every new endpoint have RequireClerkAuth and the plan gate?" — is specific enough that the agent can answer it by reading the route registration code. Without item B on the checklist, the security agent would have read the endpoint logic and found it reasonable, because the logic itself was correct. The bug was at the route registration level, above the handler. The checklist forces the review to look in the right place.

The threat model document is worth the hour it takes to write. The investment pays back on the first PR review.


7. Strategic veto stays with the human. Always.

Agents execute strategy faithfully. They do not second-guess it. That is a feature. It is also the thing that will burn you if you delegate strategy rather than execution.

During the sprint, the marketing agent dutifully built a public competitor-comparison page per the original launch brief. The PM agent reviewed it against the spec and shipped it. Both agents did exactly what they were asked. The comparison page was accurate, well-structured, and passed all the content checks.

I caught it in the post-launch review and reverted it. Public comparison pages give smaller competitors free visibility you do not want to provide. They anchor customer thinking to "TruLayer vs. X" frames that you don't control. The decision to compare publicly is a strategic one — it depends on market position, competitor momentum, and what you want the first impression to be. Neither agent was in a position to make that judgment, and the spec I had given them did not flag it as a decision point.

The revert cost two PRs and about an hour. The lesson is not "don't let agents touch marketing." The lesson is that strategy decisions need to be explicitly flagged in the brief before the agent picks them up. The brief said "build a comparison page" — that was strategic direction, not task definition. The right brief would have said: "draft a comparison page, but flag it for founder review before shipping — strategic decision, not tactical." The agent would have flagged it. I would have made the call earlier.

The broader rule: for any agent action that is hard to reverse — a published post, a deployed comparison page, a sent email — build an explicit hold-for-review step into the workflow. Agents can research, draft, build, and prepare. Irreversible decisions need a human in the loop at the point of irreversibility, not after.


Closing: build the orchestration around the failure modes

The meta-lesson from this sprint is that agent capability is not the constraint. The agents can write correct Go, navigate a complex Go monorepo, wire up Kafka consumers, catch auth bugs, write detailed test suites. The constraint is orchestration — the system that turns that capability into output that can be safely shipped.

That system has to be built around the specific ways agents fail:

  • They hallucinate blockers with the same confidence as real ones.
  • They die before their PR merges.
  • They drift into adjacent scope if their boundaries are not explicit.
  • They execute strategic decisions that a human should have made.
  • They lose access to tools at inconvenient moments.
  • They corrupt shared state when multiple agents operate on the same working directory without coordination.

Every pattern in this post is a direct response to one of those failure modes. Specialization handles scope drift. Worktree isolation handles shared-state corruption. Babysit-merge handles premature agent death. Trust-but-verify handles hallucinated blockers. Direct-API fallback handles tool outages. The security checklist handles vague review. Strategic veto handles strategic delegation.

The next product surface we are building — Guard Mode, a synchronous inline validation layer — is the production version of the same orchestration philosophy. The control loop catches failures after they happen; Guard Mode blocks them before delivery. Same primitives — eval rules, action gates, human-in-the-loop escalation — applied inline instead of post-hoc. The patterns we learned building with agents are the patterns we are shipping as product.


TruLayer is live. Starter tier: 1M spans/month, 1K eval runs, no credit card required.

If you are shipping agentic dev workflows and any of these failure modes look familiar, I would like to hear what patterns you have added. Email wei@trulayer.ai or open a discussion in the docs.