Frontier LLM capabilities have sparked an evolution of AI browser agents. An LLM thereby functions as an instantaneous web UI domain model backend: Expected to suggest goal-orientated interaction, it is provided with browsing context – a web-based task, and current runtime state of a web application. First things first: current runtime state of a web application is quite an unwieldy term. We prefer the metaphor of a snapshot. But how to compile a snapshot for use as LLM context?
Want to get started with web agents? Read our Gentle Introduction to LLM-Based Web Agents.
A Matter of Serialisation
Serialisation describes the process of converting dynamic state into a static, symbolic representation. Such a process is required whenever runtime state needs to be stored to disc, or transmitted across a network. Ideally, serialised state allows for recreating the dynamic state. The receiving end may, however, interpret dynamic state in a different way than the origin. For example, JSON is a serialisation format for simple JavaScript objects:
{
"first": 1,
"second": 2
}
import operands from "operands.json";
const sum = operands.first + operands.second;
console.log(`${operands.first} + ${operands.second} = ${sum}`);
Taking a web agent snapshot is a serialisation problem. However, we do not exactly know how a backend LLM interprets a snapshot – serialised state. All we care about is that it drives successful interaction. That said, the actual problem boils down to an optimisation problem: efficiently compiling a most effective snapshot.
GUI Snapshots – Screenshots
Evidently, screenshots resemble how humans perceive the GUI (graphical user interface) of a web application at a given point in time. For consistency reasons, we refer to screenshots as GUI snapshots. Not only are GUI snapshots an obvious choice, but LLM vision APIs subsidise image input. Roughly speaking, four characters/bytes of text cost one estimated input token1. On the LLM backend side, image input is pre-processed so as to reduce data size. Inherently, images cost relatively few input tokens. A common full-page screenshot – five times a full HD viewport – costs about four figures. To generalise, a GUI snapshot can be cosidered to cost in the order of magnitude of 1e3 (thousands).
Grounded GUI Snapshots
Pre-processing inherent to LLM vision APIs irreversibly affects image dimensions. This fact disables pixel precision when targeting elements for interaction. Grounding represents a measure of enhancing a GUI with visual cues that allow targeting elements by cue. Browser Use implements coloured bounding boxes to interactive elements, supplemented with a unique numerical identifier. Grounding can leverage multimodal input. BrowserUse, furthermore, provides element details for numerical identifiers as text.

Grounded GUI snapshot as implemented by Browser Use.
TL;DR
DOM Snapshots – Runtime HTML
Other than humans, LLMs are able to work modality and language agnostic. Text has actually represented the primary means of input. Natural and formal language artefacts render the vast majority of training data. Research supports LLMs' great success with describing and classifying HTML, and also navigating an inherent UI. The DOM (document object model) is web browsers' uniform runtime state model of a web application. Favourably, serialised DOM is isomorphic to HTML; DOM snapshots impose a promising alternative to GUI snapshots.
Serialisation of a DOM is as easy as reading from the Document API:
document.documentElement.outerHTML
.
DOM snapshots render text input, so input token and byte size strongly correlate. A real-world application's DOM can serialise to millions of bytes, and thus hundreds of thousands of LLM input tokens. Using raw DOM snapshots is economically nonviable. State-of-the-art web agents therefore implement certain techniques that highly abstract information about interactivity from DOMs. The most popular approach has been element extraction. However, extraction disposes of HTML syntax and semantics, hierarchy in particular.
Downsampled DOM Snapshots
We recently proposed a first-of-its-kind approach to pre-process DOMs so as to subsidise DOM snapshots on the client side. DOM downsampling consolidates related concepts under the assumption that a majority of DOM (HTML) inherent UI features are retained. You can think of it as image compression, just for HTML. Here's an example of a downsampled DOM:
<section class="container" tabindex="3" required="true" type="example">
<div class="mx-auto" data-topic="products" required="false">
<h1>Our Pizza</h1>
<div>
<div class="shadow-lg">
<h2>Margherita</h2>
<p>
A simple classic: mozzarela, tomatoes and basil.
An everyday choice!
</p>
<button type="button">Add</button>
</div>
<div class="shadow-lg">
<h2>Capricciosa</h2>
<p>
A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
A true favourite!
</p>
<button type="button">Add</button>
</div>
</div>
</div>
</section>
<section type="example" class="container">
# Our Pizza
<div class="shadow-lg">
## Margherita
A simple classic: mozzarela, tomatoes, and basil.
<button type="button">Add</button>
## Capricciosa
A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
<button type="button">Add</button>
</div>
</section>
DOM Downsampling for LLM-Based Web Agents covers the outlined approach with more detail.
DOM downsampling is available on the Webfuse Automation API.
TL;DR
Overview: GUI Snapshots and DOM Snapshots
GUI Snapshots | DOM Snapshots | |
---|---|---|
Data size | high | medium |
Token size | low | high |
Actuation | absolute | relative |
In-memory | no | yes |
Ready Event | load | DOMContentLoaded |
Footnotes
Related Articles
How to build a web-native AI Agent with Webfuse and Copilot
We built a tool that lets you automate tasks on the web on top of Webfuse and Microsoft Copilot. It can handle internet-based tasks directly in your browser, with no installation required. In this article, we'll cover our approach, challenges, highlights, and surprises along the way.
A Gentle Introduction to Web Agents
LLMs only recently enabled serviceable web agents: autonomous systems that browse web on behalf of a human. Get started with fundamental methodology, key design challenges, and technological opportunities.