DOM Downsampling for LLM-Based Web Agents

Downsampling visualised for digital images and HTML

Operator (OpenAI), Director (Browserbase), Browser Use – we are currently witnessing the rise of web AI agents. The first iteration of serviceable web agents was enabled by frontier LLMs, which act as instantaneous domain model backends. The domain, hereby, corresponds to the landscape of web application UIs.

What is a Snapshot?

Web agents provide an LLM with a task, and serialised runtime state of a currently browsed web application (e.g., a screenshot). The LLM is ought to suggest relevant actions to perform in the web application. Serialisation of such runtime state is referred to as a snapshot. And the snapshot technique primarily decides the quality of LLM interaction suggestions.

GUI Snapshots

Screenshots – for consistency reasons referred to as GUI snapshots – resemble how humans visually perceive web application UIs. LLM APIs subsidise the use of image input through upstream compression. Compresssion, however, irreversibly affects image dimensions, which takes away pixel precision; no way to suggest interactions like “click at 100, 735”. As a workaround, early web agents used grounded GUI snapshots. Grounding describes adding visual cues to the GUI, such as bounding boxes with numerical identifiers. Grounding lets the LLM refer to specific parts of the page by identifier, so the agent can trace back interaction targets.

Grounded GUI snapshot as implemented by Browser Use.

DOM Snapshots

LLMs arguably are much better at understanding code than images. Research supports they excel at describing and classifying HTML, and also navigating an inherent UI¹. The DOM (document object model) – a web browser's runtime state model of a web application – translates back to HTML. For this reason, DOM snapshots offer a compelling alternative to GUI snapshots. DOM snapshots offer a handful of key advantages:

DOM snapshots connect with LLM code (HTML) interpretation abilities.
DOM snapshots can be compiled from deep clones, hidden from supervision (unlike GUI grounding).
DOM snapshots render text input that on average consume less bandwidth than screnshots.
DOM snapshots allow for exact programmatic targeting of elements (e.g., via CSS selectors).
DOM snapshots are available with the DOMContentLoaded event (whereas the GUI completes initial rendering with load).

Yet, DOM snapshots have a major problem: potentially exhaustive model context. Whereas GUI snapshot commonly cost four figures of tokens, a raw DOM snapshot can cost into hundreds of thousands of tokens. To connect with LLM code interpretation abilities, however, developers have used element extraction techniques – picking only (likely) important elements from the DOM. Element extraction flattens the DOM tree, which disregards hierarchy as a potential UI feature (how do elements relate to each other?).

DOM Downsampling: A Novel Approach

To enable DOM snapshots for use with web agents, it requires client-side pre-processing – similar to how LLM vision APIs process image input. Downsampling is a fundamental signal processing technique that reduces data that scales out of time or space constraints under the assumption that the majority of relevant features is retained. Picture JPEG compression as an example: put simply, a JPEG image stores only an average colour for patches of pixels. The bigger the patches, the smaller the file. Although some detail is lost, key image features – colours, edges, objects – keep being recognisable – up to a large patch size.

We transfer the concept of downsampling to DOMs. Particularly, since such an approach retains HTML characteristics that might be valuable for an LLM backend. We define UI features as concepts that, to a substantial degree, facilitate LLM suggestions on how to act in the UI in order to solve related web-based tasks.

D2Snap

We recently proposed D2Snap²³ – a first-of-its-kind downsampling algorithm for DOMs. Herein, we'll briefly explain how the D2Snap algorithm works, and how it can be utilised to build efficient and performant web agents.

How it works

There are basically three redundant types of DOM nodes, and HTML concepts: elements, text, and attributes. We defined and empirically adjusted three node-specific procedures. D2Snap downsamples at a variable ratio, configured through procedure-specific parameters k, l, and m (∈ [0, 1]).

We used GPT-4o to create a downsampling ground truth dataset by having it classify HTML elements and scoring semantics regarding relevance for understanding the inherent UI – a UI feature degree.

Procedure: Elements

D2Snap downsamples (simplifies) elements by merging container elements like section and div together. A parameter k controls the merge ratio depending on the total DOM tree height. For competing concepts, such as element name, the ground truth determines which element's characterisitics to keep – comparing UI feature scores.

Elements in content elements (p, blockquote, ...) are translated to a more comprehensive Markdown representation.

Interactive elements, definite interaction target candidates, are kept as is.

Procedure: Text

D2Snap downsamples text by dropping a fraction. Natural units of text are space-separated words, or punctuation-separated sentences. We reuse the TextRank⁴ algorithm to rank sentences in text nodes. The lowest-ranking fraction of sentences, denoted by parameter l, is dropped.

Procedure: Attributes

D2Snap downsamples attributes by dropping those with a name that, according to ground truth, holds a UI feature degree below a threshold. Parameter m denotes this threshold.

Check out the D2Snap paper to learn about the algorithm in-depth.

Example of a Downsampled DOM

Consider a partial DOM state, serialised as HTML:

<section class="container" tabindex="3" required="true" type="example">
  <div class="mx-auto" data-topic="products" required="false">
    <h1>Our Pizza</h1>
    <div>
      <div class="shadow-lg">
        <h2>Margherita</h2>
        <p>
          A simple classic: mozzarela, tomatoes and basil.
          An everyday choice!
        </p>
        <button type="button">Add</button>
      </div>
      <div class="shadow-lg">
        <h2>Capricciosa</h2>
        <p>
          A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
          A true favourite!
          </p>
        <button type="button">Add</button>
      </div>
    </div>
  </div>
</section>

Here are some D2Snap downsampling results, which are based on different parametric configurations. A percentage denotes the reduced size.

`k=.3, l=.3, m=.3` (55%)

<section tabindex="3" type="example" class="container" required="true">
  # Our Pizza
  <div class="shadow-lg">
    ## Margherita
    A simple classic: mozzarela, tomatoes, and basil.
    <button type="button">Add</button>
    ## Capricciosa
    A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
    <button type="button">Add</button>
  </div>
</section>

`k=.4, l=.6, m=.8` (27%)

<section>
  # Our Pizza
  <div>
    ## Margherita
    A simple classic:
    <button>Add</button>
    ## Capricciosa
    A rich taste:
    <button>Add</button>
  </div>
</section>

`k→∞, l=0, ∀m` (35%)

# Our Pizza
## Margherita
A simple classic: mozzarela, tomatoes, and basil.
An everyday choice!
<button>Add</button>
## Capricciosa
A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
A true favourite!
<button>Add</button>

Asymptotic k (kind of 'infinite' k) completely flattens the DOM, that is, leads to a full content linearisation similar to reader views as present in most browsers. Notably, it preserves all interactive elements like buttons – which are essential for a web agent.

AdaptiveD2Snap

Fixed parameters might not be ideal for arbitrary DOMs – sourced from a landscape of web applications. We created AdaptiveD2Snap – a wrapper for D2Snap that infers suitable parameters from a given DOM in order to hit a certain token budget.

Implementation & Integration

Picture an LLM-based weg agent that is premised on DOM snapshots. Implementing D2Snap is simple: Deep clone the DOM, and feed it to the algorithm. Now, take the snapshot; this is, serialise the resulting DOM. Done.

Read our gentle introduction to AI agents for the web to get started with high-level web agent concepts.

The open source D2Snap API, provided as a package on GitHub provides the following signature:

type DOM = Document | Element | string;
type Options = {
  assignUniqueIDs?: boolean; // false
  debug?: boolean;           // true
};

D2Snap.d2Snap(
  dom: DOM,
  k: number, l: number, m: number,
  options?: Options
): Promise<string>

D2Snap.adaptiveD2Snap(
  dom: DOM,
  maxTokens: number = 4096,
  maxIterations: number = 5,
  options?: Options
): Promise<string>

Moreover, D2Snap it is available on the Webfuse Automation API. Webfuse essentially is a proxy to seamlessly serve any existing web application with custom augmentations, such as a web agent widget.

const domSnapshot = await browser.webfuseSession
    .automation
    .take_dom_snapshot({ modifier: 'downsample' })

Need precise control over the underlying D2Snap invocation? Configure it exactly how you want:

const domSnapshot = await browser.webfuseSession
    .automation
    .take_dom_snapshot({
        modifier: {
            name: 'D2Snap',
            params: { hierarchyRatio: 0.6, textRatio: 0.2, attributeRatio: 0.8 }
        }
    })

Performance Evaluation

Now for the moment of truth: How does D2Snap stack up against the industry standard? We evaluated D2Snap in comparison to a grounded GUI snapshot baseline close to those used by Browser Use – coloured bounding boxes around visible interactive elements.

To evaluate snapshots isolated from specific agent logic, we crafted a dataset that spans all UI states that occur while solving a related task. We sampled our dataset from the existing Online-Mind2Web dataset.

Exemplary solution UI state trajectory of a defined web-based task

Exemplary solution UI state trajectory for the task: “View the pricing plan for 'Business'. Specifically, we have 100 users. We need a 1PB storage quota and a 50 TB transfer quota.”

These are our key findings...

Substantial Success Rates

The results exceeded our expectations. Not only did D2Snap meet the baseline's performance – our best configuration outperformed it by a significant margin. Full linearisation matches performance, and estimated model input token size order of the baseline.

Success rate per web agent snapshot subject evaluated across the dataset. Labels: GUI_gr.: Baseline, DOM: Raw DOM (cut-off at ~8K tokens), k( l m): Parameter values; e.g., .9 .3 .6, or .4 if equal). ∞: Linearisation, 8192 / 32768: via token-limited (resp.) AdaptiveD2Snap.

Containable Token and Byte Size

Even light downsampling delivers dramatic size reductions. Most D2Snap configurations average just one token order above the baseline – a massive improvement over raw DOM snapshots. Better yet, most DOMs from the dataset could actually be downsampled to the baseline order. And while image data balloons in file size, our text-based approach stays lean and efficient.

Comparison of mean input size across and per subject

Left: Comparison of mean input size (tokens vs bytes) across and per subject.
Right: Estimated input token size across the dataset created by a single D2Snap evaluation subject.

Hierarchy Actually Matters

Which UI feature matters most for LLM web agent backend performance? We alternated parameter configurations to find out. Interestingly, hierarchy reveals itself as the strongest of the three assessed features. Element extraction throws away hierarchy, which suggests that downsampling is a superior technique.

DOM Downsampling for LLM-Based Web Agents

What is a Snapshot?

GUI Snapshots

DOM Snapshots

DOM Downsampling: A Novel Approach

D2Snap

How it works

Procedure: Elements

Procedure: Text

Procedure: Attributes

Example of a Downsampled DOM

`k=.3, l=.3, m=.3` (55%)

`k=.4, l=.6, m=.8` (27%)

`k→∞, l=0, ∀m` (35%)

AdaptiveD2Snap

Implementation & Integration

Performance Evaluation

Substantial Success Rates

Containable Token and Byte Size

Hierarchy Actually Matters

Next Steps

Ready to Get Started?

Stay Updated

Snapshots: Provide LLMs with Website State

A Gentle Introduction to AI Agents for the Web

DOM Downsampling for LLM-Based Web Agents

What is a Snapshot?

GUI Snapshots

DOM Snapshots

DOM Downsampling: A Novel Approach

D2Snap

How it works

Procedure: Elements

Procedure: Text

Procedure: Attributes

Example of a Downsampled DOM

k=.3, l=.3, m=.3 (55%)

k=.4, l=.6, m=.8 (27%)

k→∞, l=0, ∀m (35%)

AdaptiveD2Snap

Implementation & Integration

Performance Evaluation

Substantial Success Rates

Containable Token and Byte Size

Hierarchy Actually Matters

Footnotes

Next Steps

Ready to Get Started?

Stay Updated

Snapshots: Provide LLMs with Website State

A Gentle Introduction to AI Agents for the Web

`k=.3, l=.3, m=.3` (55%)

`k=.4, l=.6, m=.8` (27%)

`k→∞, l=0, ∀m` (35%)