AI browser agents have been a long-standing desire: “Book the cheapest flight from Berlin to Amsterdam for tomorrow afternoon.” For a human-centric web, such web agents are expected to act on UIs, rather than APIs. The heterogeneity of web application UIs, however, is out of scope for traditional agent methodologies, which are based on constrained domain models. Think of chess: board, pieces, and action space are clearly defined, and thus allow for comprehensive formalisation. In contrast, booking a hotel through booking.com is different from booking with kayak.com. And, fairly different from customising shoes on adidas.com. Or is it?
Frontier LLMs only recently enabled promising web agent technologies: Capabilities to respond according to a consistent schema render LLMs a direct backend for virtually any agentic application. The state of a chess game could as well be modelled with natural language (“... e4 e5 2. Nc3 f5 *. What is the best move for white?”). The same holds for the current runtime state of a web application, alongside a web-based task (“... Button 'Submit' centred below form. How to act to place an order?”). An LLM backend response schema for web agents maps out serial interaction suggestions, e.g., “1. Enter quantity '4' to first input, ..., n. click submit button”.
High-level Web Agent Architecture
From a high-level perspective, an LLM-based web agent consists of three components: an LLM (GPT, Claude, etc.), a web browser (Playwright, Selenium, etc.), and one that mediates between both aforementioned components with regard to web agency: iteration and communication logic, and UI. The LLM thereby fully abstracts the domain model (web-based UI/UX). As LLMs and browsers exist for reuse, web agents are low-hanging fruits – pick it to join the agentic AI hype.
Human to Autonomous Agents: Iteration
The web has been designed for humans. In order to enhance usability, it is common practice to hide complexity behind a sequence of UI states. Such states are routed by actions performed in the UI. Consider looking for rental apartments on redfin.com: First, you specify a location and basic requirements. Subsequently, you see an accordingly filtered grid of available apartments. Human and computer agents represent users alike. Norman's Seven Stages of Action, which comprehensively model the human cognition cycle, closely transfer to agentic logic.

The problem of web agent design boils down to providing the LLM backend with useful context. According to Seven Stages of Action, there is a Gulf of Execution between the agent and the environment (How to act in the environment to reach the goal?). For LLM-based web agents, this gulf is closed by the LLM backend that suggests interaction. However, the gulf arguably shifts to the Action Planning stage (How to contextualise the environment for the LLM?).
Key Challenge: Input
LLM context is commonly sub-classified as instruction (system prompt) or input. Instructions correspond to natural language problem constraints: the LLM's role, characteristics of the problem domain, and possibly a response schema. The key challenge sits in input provision; somehow serialised web application runtime state, which we'll herein refer to as snapshot. How to actually encode a snapshot most effectively, and, not least, efficiently?
GUI Snapshots
Screenshots resemble how humans visually perceive the web at a given point in time. State-of-the-art web agents have as well been premised on screenshots, i.e., GUI snapshots. Primarily, since images represent quite cheap means of model input. Not least to subsidise image input, LLM vision APIs implement pre-processing techniques that deproportionalise image data and token size. A full page GUI snapshot, which could weigh in at several megabytes, would cost roughly a few thousand tokens on either the OpenAI or Anthropic API. However, pre-processing irreversibly messes with image dimensions, precluding pixel precision. That said, to be precise, web agent's have in point of fact been premised on grounded GUI snapshots. Grounding thereby is a technique to enhance a GUI with visual cues, such as numbered bounding boxes, that map to identifiers. The LLM backend can accordingly address interaction target elements by identifier. Identifiers must be explicitly traced back in the running application by agent logic. For headful use cases, grounding means undesirable visual side effects.

GUI grounding as implemented by Browser Use1.
DOM Snapshots
Frontier LLM vision capabilities lag behind code interpretation capabilities. In fact, research supported strong abilities to describe and classify HTML, and even navigate an inherent UI2. The DOM (Document Object Model) – the browser's application state model – favourably serialises to HTML. That makes DOM snapshots an imposing alternative to GUI snapshots. DOM snapshots contain information beyond the rendered user interface. Many Human Computer Interaction (HCI) concepts get lost, but hierarchy and technical semantics represent potent UI feature alternatives. As DOM serlisation corresponds to text, however, DOM snapshots may render exhaustive LLM context. Serialised DOMs of large web applications range within hundreds of thousands of tokens. To this end, DOM snapshots have only indirectly been implemented to date. The primary approach has been element extraction; filtering a small subset of (likely) most relevant elements.
GUI Snapshots | DOM Snapshots | |
---|---|---|
Data size | high | medium |
Token size | low | high |
Actuation | absolute | relative |
In-memory | no | yes |
Ready Event | load | DOMContentLoaded |
Standalone to Augmented: Interface
Instructing AI agents with natural language quickly became the norm, and so have chat-based UIs, as known from ChatGPT. Outwith the prompt, different use cases open a space of creative UI/UX explorations. The current state of agentic AI still implies a significant error rate. For this, the human-in-the-loop approach has emerged as a (temporary) best practice in HCI. As soon as a web agent iterates, progressively browsed third-party application UIs move to the centre of attention. The agent UI, moreover, lives on top – transitioning from standalone to augmentation.
Key Challenge: Feedback
LLM round trips reintroduce time as a critical resource. Compiling a snapshot, and time between LLM request and responsonse might add up to more than a handful of seconds (!). Resting on a single UI state easily surpasses a human user's attention. Clever augmentations bridge the UI state transition gap: The most evident kind of augmentation has been in line with the copilot metaphor – sharing the model's ‘thoughts’ as feedback. Suppose a web agent is on the Wikipedia homepage. Five seconds later, it navigates to the Wikipedia article of Albert Einstein. With augmentations, the agent could chunk five seconds of idleness through a series of thoughts: “Task: When was Albert Einstein born? ... Analysing the page ... Sorting actions ... Performing actions ...”.

Agentic UI augmentation in Director.
Atomics to Compound: Actuation
A quick note on actuation: the response schema defined towards the LLM backend would correspond to direct actions. Look at the below example:
{ "action": "scroll", "target": "body", "data": 250 }
{ "action": "focus", "target": "input#quantity" }
{ "action": "type", "target": "input#quantity", "data": 4 }
{ "action": "hover", "target": "button#submit" }
{ "action": "click", "target": "button#submit" }
Although bespoke web agents act on human-centric UIs, in a way traceable for human supervision, actuation must not exactly match the granularity of human actions. Moreover, compound actions simplify an actuation schema. Secondary actions, such as scroll, focus or hover, can be implied with any action. A click, for example, would be inherent to focusing and scrolling the target element into view, and, perhaps, positioning a virtual cursor in its centre.
Tests to Evaluation: QA
Testability represents a short end of LLM-driven applications. Outcomes cannot be compared with explicit expectations. Instead, evaluations have been replacing testing in the QA cycle. Lend from qualitative empirics, end-to-end evaluations score and average responses to support regular case quality above a certain threshold. Web agents, in particular, can draw from a set of existing evaluation datasets and frameworks. To name one: Online-Mind2Web arguably represents a current gold standard. It comprises a diverse set of web-based tasks and target websites. In particular, it was supported to have a higher spread of difficulty than the previously popular WebVoyager dataset. Evaluation, put as a single sentence, works as follows: For every task in a dataset, apply the agent that is subject to evaluation, and record success, and possibly non-qualitative metrics.
Generalist vs Specialist Agents
Operator (OpenAI), Computer Use (Anthropic), and Browser Use are among the most popular early web agent applications. These tools represent generalist web agents; agents supposed to solve any task on the entire web. While generalist web agents have been critically acclaimed, they mainly act as a web automation proof-of-concept. Web agents constrained to certain applications or application domains bear most of the commercial value. For instance, an in-page agent that can be chatted with and asked questions about a product. Digital service providers can benefit from specialist web agents in many ways: displaying usage guidance, automating recurrent processes, or enhancing information retrieval.
Standalone vs Virtual Agents
The other rough distinction of web agents concerns the runtime environment. Commonly, generalist web agents run in fully qualified browsers, such as Chromium. Those are usually wrapped by browser automation frameworks. The big advantage of full browser automation is feature accessibility. A browser automation API grants access to, i.a., cross-origin application window scope, browser window dimensions, or WebRTC. The big apparent disadvantage is portability, and embeddability within web applications, in particular. Since a browser represents a standalone application, any web agent instance needs to run a browser application in the operating system's userland. Embedding web agents – generalist ones, or specialist ones supposed to work cross-origin – in arbitrary web applications as part of a service is limited to streaming a remote browser instance's UI. A similar approach is known from browser testing tools like BrowserStack.
Build Web Agents Right in the Browser
We’ve been building Webfuse – essentially an embeddable web browser. Webfuse allows any web application to render third-party applications and run custom code in their native scope. This is, it enables virtual web agents that run in any application. Check out our early-stage proof-of-concept. Custom code augmentation in Webfuse tightly integrates with the browser extension methodology. Adjacent Webfuse use cases are Embed Anything, and Zero Trust. We recently isolated the AI Agents use case, fuelled by our Automation API to further facilitate web agent development.

Footnotes
- As of June 2025. ↩
- https://arxiv.org/abs/2210.03945 ↩
Ready to Get Started?
14-day free trial
Stay Updated
Related Articles
DOM Snapshots vs Screenshots as Web Agent Context
Comparison of the two prevalent techniques to compile web application snapshots, which represent key context for LLM-based AI agents for the web: DOM snapshots and GUI snapshots (screenshots).
DOM Downsampling for LLM-Based Web Agents
We propose D2Snap – a first-of-its-kind downsampling algorithm for DOMs. D2Snap can be used as a pre-processing technique for DOM snapshots to optimise web agency context quality and token costs.