Use case
Browser-Use AI Agents
Your browser agent misidentified a "Save Draft" button as "Submit" because the CSS classes were identical. A draft contract went to the counterparty. The action was irreversible by the time it completed. Your eval pipeline scored the output after the fact — and that score is exactly what prevents the same DOM misidentification from auto-routing to the next user’s checkout flow.
Where things go wrong
Wrong-form submission (DOM misidentification)
The agent identifies "Save Draft" as the intended submit target because both buttons share a CSS class and the agent’s element-selector reasoning was based on class name, not element position or ARIA label. A draft contract is sent to the counterparty. The action completes before any downstream system can intervene.
An irreversible document action on behalf of the user; undoing it requires manual counterparty outreach, legal review, or both.
Credential retry account lockout
The agent retries a failed login attempt with slight password variations to recover from an authentication error. The third failed attempt locks the account. Each retry was a valid, well-formed tool call — the function-call evaluator would score each one as correctly invoked. The failure is at the strategy level: retrying credentials without a lockout threshold is the wrong action sequence for this workflow type.
Account lockout requiring user support intervention; if the locked account belongs to a customer of the product, the support cost and user friction are both attributable to the agent’s retry behavior.
Transaction misrouting from context bleed
The agent fills a payment or transfer form using a dollar amount from the wrong context window — a prior session’s state that was not cleared before the new task began. The form values look plausible to the agent because it is operating from stale state. The transaction fires with the wrong amount or the wrong recipient.
A financial transaction that is difficult or impossible to reverse; the error is traceable to a context management failure in the agent, not to any input the user provided.
Prompt injection via page content
A webpage the agent visits contains hidden or visually obscured text instructing it to navigate to a different URL or perform an additional action before completing the intended task. The agent reads the injected instruction as part of its context and executes it, adding an unintended step to the workflow.
Out-of-scope actions taken on behalf of the user on third-party websites; the injected step is invisible in the agent’s output unless the span is scored for prompt injection.
Eval + control loop
What happens when a rule fires
The response
How TruLayer closes the loop
- Function Call
- Prompt Injection
- Faithfulness
- Hallucination
Browser-use agents operate at the intersection of two distinct failure surfaces: the correctness of each individual action and the integrity of the instruction the agent is operating under. The function-call correctness evaluator scores every browser action — click, fill, submit — against the task specification: was the correct element targeted with the correct value, matching what the task required? A DOM misidentification is a function-call correctness failure: the agent invoked a browser action with the wrong element selector. That score fires on the span that produced the misidentification. On the next run of the same browser-task type — the next user’s checkout workflow, the next draft-submission session — the control loop routes the request to a human review queue before the same element-selection pattern can auto-execute again. The action that already completed cannot be recalled; the recurrence is what the control loop prevents.
The prompt injection evaluator addresses the threat class that is native to browser automation: adversarial content embedded in the pages the agent visits. A webpage that contains hidden instructions designed to redirect the agent’s behavior is an injection attack against the browser session. For browser-use agents operating across third-party websites — procurement portals, counterparty systems, public web forms — the injection surface is every page the agent loads. The prompt injection evaluator scores whether the agent’s context was overridden by page content, and the control loop routes flagged task types to human review before the same page-content override can execute on the next user’s session. The faithfulness evaluator runs alongside injection scoring: it catches the case where the agent’s chosen action is not grounded in the task specification it was given, regardless of whether an injection is present — the stale-context-bleed failure, where a prior session’s amount or recipient bleeds into the current transaction, is a faithfulness failure.
When a rule fires, the remediation is specific to the task type that produced the failure. For a DOM misidentification on a checkout flow, the control loop routes that browser-task type to a human review queue: a human operator reviews the element-selection decision before the same task type auto-executes on the next user’s workflow. For a prompt injection detection on a specific domain, the rule can route all future agent sessions on that domain to human confirmation before proceeding. The per-trace before/after delta shows which span produced the failing score and what routing action the control loop took — for an engineering team operating browser automation at scale, this surfaces the failure patterns that are endemic to specific task types or specific websites, so the rule configuration can be tightened before the pattern affects more users.
See it in practice
Instrument your browser-use ai agent in two lines.
Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'
const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())
// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the browser-automation project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: task }],
})Ship reliable browser-use ai agents.
Free tier includes 1M spans / month · No credit card