Accessibility Trees vs Screenshots in LLM Browser Agents

Every time your browser agent takes a screenshot, it burns roughly 1,500 vision tokens before it has even thought about what to do. Multiply that by 20 steps per task and you have spent 30,000 tokens on pixels. There is a better representation sitting directly in the browser's process memory: the accessibility tree.

Why This Matters

The choice between screenshots and accessibility trees is not a UI preference — it is an architecture decision that cascades into cost, latency, reliability, and the kinds of tasks your agent can actually solve. The Microsoft Playwright MCP server makes this choice explicit: it exposes browser control through the accessibility tree by default, skipping vision entirely. Understanding why this works — and where it breaks — is the key engineering decision for anyone building browser automation pipelines today.

The Accessibility Tree: What It Actually Is

Every modern browser maintains two parallel representations of a web page. The first is the DOM (Document Object Model), the raw HTML tree you see in DevTools. The second is the accessibility tree, a parallel structure that the browser computes and exposes to assistive technology: screen readers, keyboard navigation tools, and automation frameworks.

The accessibility tree prunes and annotates the DOM. Purely decorative elements — empty divs, spacer images, CSS-only content — are excluded. Every included node has a role drawn from the ARIA specification: button, textbox, link, combobox, list, dialog, and so on. Each node has a computed accessible name, derived from aria-label attributes, alt text, label associations, or inner text. Nodes carry states (checked, disabled, expanded, focused) and properties (required, readonly, multiselectable). Critically, each node exposes its available actions: a button can be pressed, a textbox can be filled, a combobox has selectable options.

Here is a minimal example. Given this HTML:

<button aria-label="Submit form" disabled>
  <span>Submit</span>
</button>
<input type="text" placeholder="Email" name="email" required />

The browser computes an accessibility tree that looks like:

role=button name="Submit form" disabled=true
role=textbox name="Email" required=true focused=false

The entire page — even a moderately complex one — serializes to a few hundred tokens of structured text. A screenshot of the same page at 1024×768, encoded and sent to a vision model, costs 1,000–5,000 tokens depending on the model's internal tile resolution. The core bet is simple: for most web automation tasks, you do not need the pixels. You need the semantics.

How Playwright MCP Exposes This

The Model Context Protocol is Anthropic's open protocol for connecting LLMs to external tools and data sources. It runs JSON-RPC over stdio or server-sent events, and defines a standard way for a server to expose tools (callable functions with typed parameters), resources (named data blobs), and prompts (templated system instructions) to an LLM client.

The Microsoft Playwright MCP server implements this protocol with a headless Chromium browser under the hood. When the LLM calls a tool, Playwright executes the browser action and returns the updated page state.

The key design decision: by default, the server returns the accessibility tree as the page observation, not a screenshot. Screenshots are available as an explicit opt-in tool (browser_take_screenshot), not the default observation.

The exposed tool surface includes:

browser_navigate(url) — navigate to a URL, returns the accessibility tree
browser_click(element, ref) — click a node by its stable reference ID
browser_type(element, ref, text) — fill a text field by reference
browser_select_option(element, ref, values) — select from a dropdown
browser_scroll(direction, coordinate) — scroll the page
browser_wait_for(time | text) — wait for a duration or until text appears in the tree
browser_take_screenshot() — capture a base64 image (explicit fallback)

The ref parameter is the key detail. Instead of asking the LLM to specify pixel coordinates or describe what to click in natural language, each node in the serialized accessibility tree carries a stable reference ID. The LLM reads the tree, identifies the target node, and passes back the ref. Playwright resolves the ref to the actual DOM element and performs the action. This eliminates the coordinate grounding problem — finding the (x, y) pixel location of an element from a screenshot — that makes vision-based agents brittle.

Here is the minimal agent loop in Python:

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from anthropic import Anthropic

async def run_browser_task(task: str) -> str:
    server_params = StdioServerParameters(
        command="npx",
        args=["@playwright/mcp@latest", "--headless"]
    )
    client = Anthropic()
    messages = [{"role": "user", "content": task}]

    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            tools = [
                {"name": t.name, "description": t.description,
                 "input_schema": t.inputSchema}
                for t in (await session.list_tools()).tools
            ]

            while True:
                response = client.messages.create(
                    model="claude-opus-4-7",
                    max_tokens=4096,
                    tools=tools,
                    messages=messages
                )
                if response.stop_reason == "end_turn":
                    return next(b.text for b in response.content
                                if hasattr(b, "text"))

                tool_results = []
                for block in response.content:
                    if block.type == "tool_use":
                        result = await session.call_tool(block.name, block.input)
                        tool_results.append({
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": result.content[0].text
                        })
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})

The full agent loop is 35 lines. No vision model, no screenshot encoding, no coordinate mapping.

Token Economics

Let us put concrete numbers on the cost difference.

A screenshot from a modern browser at 1024×768, encoded and transmitted to a vision-capable model:

claude-sonnet-4-6: approximately 1,600 input tokens per image (base resolution tile)
GPT-4o (high detail): approximately 3,060 tokens for the same image (765 tokens per 512×512 tile)

A serialized accessibility tree for the same page averages 200–800 tokens. A Google Search results page serializes to roughly 400 tokens. A GitHub PR page with many comments might reach 1,200 tokens — comparable to a screenshot only for the most complex pages, but structured and directly actionable without any pixel interpretation.

For a 20-step web task at claude-sonnet-4-6 pricing ($3.00/M input tokens):

Observation method	Tokens per step	Total (20 steps)	Cost
Screenshot (vision model)	~1,600	~32,000	~$0.096
Accessibility tree	~500	~10,000	~$0.030
Hybrid (tree default, screenshot on demand)	~600	~12,000	~$0.036

At 10,000 tasks per day, the screenshot approach costs roughly $960/day. The accessibility tree approach costs $300/day. That is $240,000/year in token savings from a single architectural decision.

Latency is the other dimension. Screenshot-based agents must render the page, capture the image, base64-encode it, transmit the extra tokens, and wait for the vision model to process them — vision tokens carry higher compute cost per token than text on most implementations. Accessibility tree agents skip all of this. Empirically, this removes 0.5–2 seconds per step depending on page complexity and network conditions.

Agent Architecture

flowchart LR
    subgraph Agent Loop
        A[Task + System Prompt] --> B[Claude / LLM]
        B -->|tool_use block| C{Tool Router}
        C -->|navigate / click / type| D[Playwright MCP Server]
        C -->|screenshot — fallback only| D
        D -->|accessibility tree text| B
        B -->|end_turn| E[Final Answer]
    end

    subgraph Browser Process
        D --> F[Chromium Headless]
        F -->|Platform Accessibility API| G[Tree Snapshot]
        G --> D
    end

The MCP server sits between the LLM and the browser process. The LLM never sees raw HTML or pixels in the default path — it sees only the structured accessibility representation that Chromium computes natively via the platform accessibility APIs (UI Automation on Windows, ATK/AT-SPI on Linux, NSAccessibility on macOS).

Where Accessibility Trees Break

The efficiency gains are real. So are the failure modes.

Visual-only content. Canvas elements, WebGL rendering, SVG graphics with no ARIA annotations, and custom-drawn UI components often have zero accessibility semantics. A charting library that renders its labels inside a <canvas> tag will appear as role=img name="" in the accessibility tree — semantically useless. Screenshot fallback is necessary for these cases. The hybrid strategy handles most of them in practice.

Dynamic content and race conditions. Single-page applications update the DOM asynchronously after user actions. After the agent clicks a button that triggers a network fetch, the accessibility tree snapshot returned by the server may not yet reflect the updated state. The Playwright MCP server provides browser_wait_for(text) to wait until a specified string appears in the tree, but the agent has to know to call it. This typically requires an explicit post-navigation wait policy in the system prompt, or a wrapper that monitors tree stability before returning the observation.

Broken ARIA hygiene. A significant fraction of production websites have incorrect or missing ARIA annotations: unlabeled buttons (role=button name=""), mislabeled roles, focus traps that prevent programmatic navigation. These are endemic on older enterprise web applications. The WebArena benchmark found that roughly 30% of tasks on real websites required fallback to visual grounding precisely because of poor ARIA annotation quality. Against such targets, the accessibility tree degrades toward noise.

iframes and shadow DOM. Cross-origin iframes are isolated by browser security policy; their accessibility trees are not exposed to the parent context. Shadow DOM is partially exposed but requires explicit traversal. Many authentication flows — OAuth popups, Stripe payment widgets, reCAPTCHA challenges — live in cross-origin iframes. The Playwright MCP server silently omits these nodes, which is easy to miss during testing but obvious in production when authentication steps silently fail.

The Hybrid Strategy

The production-grade approach is a cost-ordered fallback, not a binary choice.

First, default to the accessibility tree. This handles form filling, navigation, link following, and most standard web interactions cheaply and reliably. Second, if the agent cannot locate a target element by name or role, call browser_take_screenshot and route that single step to a vision-capable model. The next step returns to the tree. Third, for CAPTCHA challenges, strong visual verification, or multi-step authentication through cross-origin iframes, escalate to a human-in-the-loop step or a specialized solver service.

This tiered approach matches what Browser Use reports — 89.1% task success on the WebVoyager benchmark using a primarily tree-based strategy with selective screenshot use. Pure screenshot-based agents achieve 59–65% on the same benchmark at 3–5× higher token cost.

Practitioner's Lens

If you are shipping a browser automation pipeline today, the Playwright MCP server is the fastest path to a working prototype. Running npx @playwright/mcp@latest starts a local MCP server that any Claude-backed agent can connect to out of the box. The accessibility tree default handles roughly 70–80% of typical web automation tasks without any vision budget.

Plan your system prompt around the three main failure modes explicitly: tell the agent to call browser_wait_for after any click that triggers navigation, to call browser_take_screenshot when it cannot find an element in the tree, and to surface iframe-blocked steps rather than retrying silently. For enterprise targets with poor ARIA hygiene, budget for site-specific tuning or a thin HTML-scraping fallback that extracts semantic context from the raw page source when the tree is sparse.

One underappreciated advantage: the accessibility tree is deterministic and human-readable, which makes it far easier to debug and evaluate than pixel-based observations. When an agent fails on a task, you can replay the exact tree it saw at each step and reason precisely about which observation caused the wrong action. With screenshots, debugging means staring at compressed images in logs. This debuggability advantage compounds over time in production systems where failure analysis drives iteration velocity.

The security picture is also cleaner. Because the observation is structured text rather than rendered pixels, prompt injection via visually hidden or off-screen content is harder — the injected text must appear in an accessible element with a real role and name, which is more easily filtered than arbitrary pixel regions. Not impossible, but the attack surface is narrower.