A Gentle Introduction to Web Agents: Foundations & Challenges

AI agents for the web have been a longstanding desire. A web agent is a system that autonomously browses the web on behalf of a human: “Book the cheapest flight from Berlin to Amsterdam tomorrow.” Web agents should act on UIs, rather than APIs. Not least since the web has been designed for humans, but for a seamless human-in-the-loop experience.

Traditional search agents rely on detailed models of the problem domain – think of chess played by a computer, or an apartment worked by a vacuum robot. From a user perspective, however, the web corresponds to a plethora of heterogeneous application interfaces. Beyond a conceptual level, booking a hotel through booking.com is very different from shopping shoes on adidas.com. A viable search model of the web is not feasible, and would ultimately scale out of everyday device memory.

Architecture

Web agents have only recently become a promising technology, as frontier LLMs were enabled to respond with a consistent schema. A guarantee that renders LLMs a direct backend for virtually any agentic application. Paired with multimodal input capabilities, LLMs can be consulted with a web task (“Book a flight…”), and different representations of web application state. A declared LLM response schema for web agents is ought to map serial action suggestions to executable actions.

Despite their appeal, web agents are low hanging fruits. Two of the three highlevel components exist for reuse: LLM (GPT, Claude, etc.), and automated web browser (Playwright, Selenium, etc.). The remaining component encapsulates the actual value proposition; agent logic and interface. Depending on the use case, web agents open a space of creative UI/UX explorations.

Key Challenge: Input

The actual problem of web agent design boils down to providing the LLM with useful context – an optimisation problem. Context is commonly subclassified as instructions and input. Instructions correspond to natural language descriptions of the problem. Input, more considerably, corresponds to somehow serialised web application state. More comprehensively referred to as a snapshot, snapshot representations represent a a novel research subject.

GUI Snapshots

Screenshots evidently resemble how humans perceive a web application at a given point in time. In fact, state-of-the-art web agents (and computer agents in general) are primarily premised on GUI snapshots. GUI snapshots transfer HCI principles, especially information hiding. Full page screenshot capabilities are implicit with browser automation frameworks. Desirably, images pose relatively cheap means of input due to server-side convolution. A full page screenshot costs tokens in orders of 10e3. Yet, tools like Browser Use visually adjust the GUI before taking a screenshot. A suggestion that LLMs are not naturally good with GUIs. Visual side effects are moreover undesirable with headful use cases.

DOM Snapshots

Serialisations of a DOM (Document Object Model) – the runtime model of a web application – is the obvious alternative to serialisation of the GUI. In contrast to GUI snapshots, DOM snapshots contain information beyond the rendered user interface. Serialising the DOM results in valid HTML. It is as easy as reading from the Document API (document.documentElement.outerHTML). As a fact, agentic IDEs like Cursor strongly support LLMs great abilities to semantically analyse code. The single blocker for DOM snapshots is size: real world DOMs sometimes exceed a megabyte, and thus cost tokens in orders of 1e5. ‍

	GUI Snapshots	DOM Snapshots
Upload (Bytes)	high	medium
Costs (Tokens)	low	high
Actuation	absolute	relative
Mutability	rendered	virtualised
Availability (event)	`load`	`DOMContentLoaded`

Besides DOM snapshots, different abstractions of text-based application state representations are worth investigating. Most evidently, downsized DOM representations. To name a few more: Linearised content (as known from accessibility or reader views), extracted content (provide only top-k elements), and paraphrased content (natural language translations).

Evaluation

Testability represents a short end of LLM-driven applications. Other than deterministic applications or application modules, outcomes cannot be compared with explicit expectations. Instead, end-to-end evaluation benchmarks have been filling the QA gap. Lend from qualitative empirics, such benchmarks score and average responses to support regular case quality above a certain threshold. Web agents, in particular, can draw from a set of existing benchmarks: WebVoyager has represented a ubiquitous evaluation benchmark, useful as an initial benchmark. The considerable gold standard as of today has just recently been released: Online-Mind2Web, which has shown higher difficulty across a more diverse set of realistic web tasks.

Generalist vs Specialist Web Agents

Operator (OpenAI), Computer Use (Anthropic), and Browser Use are among the most popular web agent representatives. These tools represent generalist web agents: agents supposed to solve any task on the web. Inherently, generalist web agent UIs resemble generic LLM UIs, such as ChatGPT. While generalist web agents have been critically acclaimed, they mainly act as a proof-of-concept. Web agents constrained to a certain application or application domain bear the primary commercial value. Digital service providers can benefit from specialist web agents to provide users with guidance, automate recurrent processes, or enhanced information retrieval.

Virtual Web Agents

We’ve been building Webfuse – a powerful virtual browser. Webfuse enables web agents that run in any browser, in any application. At that, it replaces heavy backends of automated browsers. Check out our proof-of-concept. Webfuse is the easiest way to add specialist web agents to an application, without altering its code.