DOM Downsampling for LLM-Based Web Agents

August 18, 202511 min read
Downsampling visualised for digital images and HTML

Browser Use, Director, Operator – we are currently witnessing the rise of AI browser agents. The first iteration of serviceable web agents was enabled by frontier LLMs. In this setup, an LLM acts as an instantaneous web UI (user interface) domain model backend; in simple terms, a website UI interpreter. Traditionally, an agent would require a detailed, rigid model of UIs. Not only is this hardly feasible, but, for perspective, everyday devices already fail at fully analysing a model chess. Instead of a rigid model, web agents provide an LLM with a task, and serialised state of a currently browsed web application (e.g., a screenshot). The LLM suggests interactions – what to do next.

What is a Web Application Snapshot?

Screenshots – for consistency reasons referred to as GUI snapshots – resemble how humans visually perceive web application UIs. LLM APIs subsidise the use of image input through upstream compression. Compresssion, however, irreversibly affects image dimensions, which takes away pixel precision; no way to suggest interactions like “click at 100, 735”. As a workaround, early web agents used grounded GUI snapshots. Grounding describes adding visual cues to the GUI, such as bounding boxes with numerical identifiers. Grounding, not least, lets the LLM refer to specific parts of the page by identifier, so the agent can trace back interaction targets.

Grounded GUI snapshot as implemented by Browser Use

Grounded GUI snapshot as implemented by Browser Use.

LLMs arguably are much better at understanding code than images. Research supports they excel at describing and classifying HTML, and also navigating an inherent UI1. The DOM (document object model) – a web browser's runtime state model of a web application – translates back to HTML. For this reason, DOM snapshots offer a compelling alternative to GUI snapshots. DOM snapshots offer a handful of key advantages:

  1. DOM snapshots connect with LLM code (HTML) interpretation abilities.
  2. DOM snapshots can be compiled from deep clones, hidden from supervision (unlike GUI grounding).
  3. DOM snapshots render text input that on average consume less bandwidth than screnshots.
  4. DOM snapshots allow for exact programmatic targeting of elements (e.g., via CSS selectors).
  5. DOM snapshots are available with the DOMContentLoaded event (whereas the GUI completes initial rendering with load).

Yet, DOM snapshots have a major problem: potentially exhaustive model context. Whereas GUI snapshot commonly cost four figures of tokens, a raw DOM snapshot can cost into hundreds of thousands of tokens. To connect with LLM code interpretation abilities, however, developers have used element extraction techniques – picking only (likely) important elements from the DOM. Element extraction flattens the DOM tree, which disregards hierarchy as a potential UI feature (how do elements relate to each other?).

DOM Downsampling: A Novel Approach

o enable DOM snapshots for use with web agents, it requires client-side pre-processing – similar to how LLM vision APIs process image input. Downsampling is a fundamental signal processing technique that reduces data that scales out of time or space constraints under the assumption that the majority of relevant features is retained. Picture JPEG compression as an example: put simply, a JPEG image stores only an average colour for patches of pixels. The bigger the patches, the smaller the file. Although some detail is lost, key image features – colours, edges, objects – keep being recognisable – up to a large patch size.

We transfer the concept of downsampling to DOMs. Particularly, since such an approach retains HTML characteristics that might be valuable for an LLM backend. We define UI features as concepts that, to a substantial degree, facilitate LLM suggestions on how to act in the UI in order to solve related web-based tasks.

D2Snap

We recently proposed D2Snap23 – a first-of-its-kind downsampling algorithm for DOMs. Herein, we'll briefly explain how the D2Snap algorithm works, and how it can be utilised to build efficient and performant web agents.

How it works

There are basically three redundant types of DOM nodes, and HTML concepts: elements, text, and attributes. We defined and empirically adjusted three node-specific procedures. D2Snap downsamples at a variable ratio, configured through procedure-specific parameters k, l, and m (∈ [0, 1]).

We used GPT-4o to create a downsampling ground truth dataset by having it classify HTML elements and scoring semantics regarding relevance for understanding the inherent UI – a UI feature degree.

Procedure: Elements

D2Snap downsamples (simplifies) elements by merging container elements like section and div together. A parameter k controls the merge ratio depending on the total DOM tree height. For competing concepts, such as element name, the ground truth determines which element's characterisitics to keep – comparing UI feature scores.

Elements in content elements (p, blockquote, ...) are translated to a more comprehensive Markdown representation.

Interactive elements, definite interaction target candidates, are kept as is.

Procedure: Text

D2Snap downsamples text by dropping a fraction. Natural units of text are space-separated words, or punctuation-separated sentences. We reuse the TextRank4 algorithm to rank sentences in text nodes. The lowest-ranking fraction of sentences, denoted by parameter l, is dropped.

Procedure: Attributes

D2Snap downsamples attributes by dropping those with a name that, according to ground truth, holds a UI feature degree below a threshold. Parameter m denotes this threshold.

Check out the D2Snap paper to learn about the algorithm in-depth.

Example

Consider a partial DOM state, serialised as HTML:

<section class="container" tabindex="3" required="true" type="example">
  <div class="mx-auto" data-topic="products" required="false">
    <h1>Our Pizza</h1>
    <div>
      <div class="shadow-lg">
        <h2>Margherita</h2>
        <p>
          A simple classic: mozzarela, tomatoes and basil.
          An everyday choice!
        </p>
        <button type="button">Add</button>
      </div>
      <div class="shadow-lg">
        <h2>Capricciosa</h2>
        <p>
          A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
          A true favourite!
          </p>
        <button type="button">Add</button>
      </div>
    </div>
  </div>
</section>

Here are some D2Snap downsampling results, which are based on different parametric configurations. A percentage denotes the reduced size.

k=.3, l=.3, m=.3 (55%)

<section tabindex="3" type="example" class="container" required="true">
  # Our Pizza
  <div class="shadow-lg">
    ## Margherita
    A simple classic: mozzarela, tomatoes, and basil.
    <button type="button">Add</button>
    ## Capricciosa
    A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
    <button type="button">Add</button>
  </div>
</section>

k=.4, l=.6, m=.8 (27%)

<section>
  # Our Pizza
  <div>
    ## Margherita
    A simple classic:
    <button>Add</button>
    ## Capricciosa
    A rich taste:
    <button>Add</button>
  </div>
</section>

k→∞, l=0, ∀m (35%)

# Our Pizza
## Margherita
A simple classic: mozzarela, tomatoes, and basil.
An everyday choice!
<button>Add</button>
## Capricciosa
A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
A true favourite!
<button>Add</button>

Asymptotic k (kind of 'infinite' k) completely flattens the DOM, that is, leads to a full content linearisation similar to reader views as present in most browsers. Notably, it preserves all interactive elements like buttons – which are essential for a web agent.

AdaptiveD2Snap

Fixed parameters might not be ideal for arbitrary DOMs – sourced from a landscape of web applications. We created AdaptiveD2Snap – a wrapper for D2Snap that infers suitable parameters from a given DOM in order to hit a certain token budget.

Implementation & Integration

Picture an LLM-based weg agent that is premised on DOM snapshots. Implementing D2Snap is simple: Deep clone the DOM, and feed it to the algorithm. Now, take the snapshot; this is, serialise the resulting DOM. Done.

Read our Gentle Introduction to LLM-Based Web Agents to grasp high-level web agent methodology.

The open source D2Snap API provides the following signature:

type DOM = Document | Element | string;
type Options = {
  assignUniqueIDs?: boolean; // false
  debug?: boolean;           // true
};

D2Snap.d2Snap(
  dom: DOM,
  k: number, l: number, m: number,
  options?: Options
): Promise<string>

D2Snap.adaptiveD2Snap(
  dom: DOM,
  maxTokens: number = 4096,
  maxIterations: number = 5,
  options?: Options
): Promise<string>

Performance Evaluation

Now for the moment of truth: How does D2Snap stack up against the industry standard? We evaluated D2Snap in comparison to a grounded GUI snapshot baseline close to those used by Browser Use – coloured bounding boxes around visible interactive elements.

To evaluate snapshots isolated from specific agent logic, we crafted a dataset that spans all UI states that occur while solving a related task. We sampled our dataset from the existing Online-Mind2Web dataset.

Exemplary solution UI state trajectory of a defined web-based task

Exemplary solution UI state trajectory for the task: “View the pricing plan for 'Business'. Specifically, we have 100 users. We need a 1PB storage quota and a 50 TB transfer quota.”

These are our key findings...

Substantial Success Rates

The results exceeded our expectations. Not only did D2Snap meet the baseline's performance – our best configuration outperformed it by a significant margin. Full linearisation matches performance, and estimated model input token size order of the baseline.

Success rate per web agent snapshot subject evaluated across the dataset Success rate per web agent snapshot subject evaluated across the dataset. Labels: GUI gr.: Baseline, DOM: Raw DOM (cut-off at ~8K tokens), k( l m): Parameter values; e.g., .9 .3 .6, or .4 if equal). : Linearisation, 8192 / 32768: via token-limited (resp.) AdaptiveD2Snap.

Containable Token and Byte Size

Even light downsampling delivers dramatic size reductions. Most D2Snap configurations average just one token order above the baseline – a massive improvement over raw DOM snapshots. Better yet, most DOMs from the dataset could actually be downsampled to the baseline order. And while image data balloons in file size, our text-based approach stays lean and efficient.

Comparison of mean input size across and per subject Left: Comparison of mean input size (tokens vs bytes) across and per subject.
Right: Estimated input token size across the dataset created by a single D2Snap evaluation subject.

Hierarchy Actually Matters

Which UI feature matters most for LLM web agent backend performance? We alternated parameter configurations to find out. Interestingly, hierarchy reveals itself as the strongest of the three assessed features. Element extraction throws away hierarchy, which suggests that downsampling is a superior technique.

Ready to Build the Future of Web Agents?

D2Snap is production-ready technology that's already transforming how developers build web agents. Webfuse essentially is an in-app web browser that can be programmed as to build web agents for any website. Effortlessly downsample with Webfuse's Automation API:

const domSnapshot = await browser.webfuseSession
    .automation
    .take_dom_snapshot({ modifier: 'downsample' })

Need precise control over the underlying D2Snap invocation? Configure it exactly how you want:

const domSnapshot = await browser.webfuseSession
    .automation
    .take_dom_snapshot({
        modifier: {
            name: 'D2Snap',
            params: { hierarchyRatio: 0.6, textRatio: 0.2, attributeRatio: 0.8 }
        }
    })

The web agent revolution is here. While others struggle with expensive snapshots techniques, build faster, cheaper, and more intelligent agents yourself with the help of Webfuse.

Footnotes

  1. https://arxiv.org/abs/2210.03945
  2. https://arxiv.org/abs/2508.04412
  3. https://github.com/surfly/D2Snap
  4. https://aclanthology.org/W04-3252

Related Articles