DOM Snapshots vs Screenshots as Web Agent Context

August 28, 20257 min read

Frontier LLM capabilities have sparked an evolution of AI browser agents. An LLM thereby functions as an instantaneous web UI domain model backend: Expected to suggest goal-orientated interaction, it is provided with browsing context – a web-based task, and current runtime state of a web application. First things first: current runtime state of a web application is quite an unwieldy term. We prefer the metaphor of a snapshot. But how to compile a snapshot for use as LLM context?

Want to get started with web agents? Read our Gentle Introduction to LLM-Based Web Agents.

A Matter of Serialisation

Serialisation describes the process of converting dynamic state into a static, symbolic representation. Such a process is required whenever runtime state needs to be stored to disc, or transmitted across a network. Ideally, serialised state allows for recreating the dynamic state. The receiving end may, however, interpret dynamic state in a different way than the origin. For example, JSON is a serialisation format for simple JavaScript objects:

{
  "first": 1,
  "second": 2
}
import operands from "operands.json";

const sum = operands.first + operands.second;

console.log(`${operands.first} + ${operands.second} = ${sum}`);

Taking a web agent snapshot is a serialisation problem. However, we do not exactly know how a backend LLM interprets a snapshot – serialised state. All we care about is that it drives successful interaction. That said, the actual problem boils down to an optimisation problem: efficiently compiling a most effective snapshot.

GUI Snapshots – Screenshots

Evidently, screenshots resemble how humans perceive the GUI (graphical user interface) of a web application at a given point in time. For consistency reasons, we refer to screenshots as GUI snapshots. Not only are GUI snapshots an obvious choice, but LLM vision APIs subsidise image input. Roughly speaking, four characters/bytes of text cost one estimated input token1. On the LLM backend side, image input is pre-processed so as to reduce data size. Inherently, images cost relatively few input tokens. A common full-page screenshot – five times a full HD viewport – costs about four figures. To generalise, a GUI snapshot can be cosidered to cost in the order of magnitude of 1e3 (thousands).

Grounded GUI Snapshots

Pre-processing inherent to LLM vision APIs irreversibly affects image dimensions. This fact disables pixel precision when targeting elements for interaction. Grounding represents a measure of enhancing a GUI with visual cues that allow targeting elements by cue. Browser Use implements coloured bounding boxes to interactive elements, supplemented with a unique numerical identifier. Grounding can leverage multimodal input. BrowserUse, furthermore, provides element details for numerical identifiers as text.

Grounded GUI snapshot as implemented by Browser Use

Grounded GUI snapshot as implemented by Browser Use.

TL;DR

➕   Low input token size
➖   High byte size
➖   No direct element targeting

DOM Snapshots – Runtime HTML

Other than humans, LLMs are able to work modality and language agnostic. Text has actually represented the primary means of input. Natural and formal language artefacts render the vast majority of training data. Research supports LLMs' great success with describing and classifying HTML, and also navigating an inherent UI. The DOM (document object model) is web browsers' uniform runtime state model of a web application. Favourably, serialised DOM is isomorphic to HTML; DOM snapshots impose a promising alternative to GUI snapshots.

Serialisation of a DOM is as easy as reading from the Document API: document.documentElement.outerHTML.

DOM snapshots render text input, so input token and byte size strongly correlate. A real-world application's DOM can serialise to millions of bytes, and thus hundreds of thousands of LLM input tokens. Using raw DOM snapshots is economically nonviable. State-of-the-art web agents therefore implement certain techniques that highly abstract information about interactivity from DOMs. The most popular approach has been element extraction. However, extraction disposes of HTML syntax and semantics, hierarchy in particular.

Downsampled DOM Snapshots

We recently proposed a first-of-its-kind approach to pre-process DOMs so as to subsidise DOM snapshots on the client side. DOM downsampling consolidates related concepts under the assumption that a majority of DOM (HTML) inherent UI features are retained. You can think of it as image compression, just for HTML. Here's an example of a downsampled DOM:

<section class="container" tabindex="3" required="true" type="example">
  <div class="mx-auto" data-topic="products" required="false">
    <h1>Our Pizza</h1>
    <div>
      <div class="shadow-lg">
        <h2>Margherita</h2>
        <p>
          A simple classic: mozzarela, tomatoes and basil.
          An everyday choice!
        </p>
        <button type="button">Add</button>
      </div>
      <div class="shadow-lg">
        <h2>Capricciosa</h2>
        <p>
          A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
          A true favourite!
          </p>
        <button type="button">Add</button>
      </div>
    </div>
  </div>
</section>
<section type="example" class="container">
  # Our Pizza
  <div class="shadow-lg">
    ## Margherita
    A simple classic: mozzarela, tomatoes, and basil.
    <button type="button">Add</button>
    ## Capricciosa
    A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
    <button type="button">Add</button>
  </div>
</section>

DOM Downsampling for LLM-Based Web Agents covers the outlined approach with more detail.

DOM downsampling is available on the Webfuse Automation API.

TL;DR

➕   Low to medium byte size
➕   Direct element targeting
➖   High input token size

Overview: GUI Snapshots and DOM Snapshots

GUI SnapshotsDOM Snapshots
Data sizehighmedium
Token sizelowhigh
Actuationabsoluterelative
In-memorynoyes
Ready EventloadDOMContentLoaded

Footnotes

  1. LLM input token costs are estimates based on API documentation of Open AI, and Anthropic.

Related Articles