How Voice Agents See & Act: A Guide to DOM Tools

In the previous articles, we explored the architectural principles required to build a voice agent capable of providing real-time, interactive assistance on any website. In Part 1, we examined three approaches to agent interaction, and in Part 2, we established that the most effective approach is to embed the agent directly within the user's live browser session using a platform like Webfuse. This client-side model, powered by Virtual Web Sessions and Session Extensions, creates the ideal environment for an AI to act as a true co-pilot.

Now, we move from the architectural "how" to the practical "what." An agent, even when correctly placed within a session, is only as capable as the tools it has at its disposal. To be genuinely helpful, it needs digital senses to perceive its environment and digital limbs to act within it. This article is a practical guide to these essential capabilities, focusing on the two primary categories of tools provided by the Webfuse Automation API:

Perception Tools: These are the agent's "eyes." They allow the AI to analyze and understand the content and structure of the webpage the user is currently viewing. Without perception, the agent is blind, unable to provide contextually relevant assistance or identify the correct targets for its actions.
Action Tools: These are the agent's "hands." They enable the AI to perform direct interactions on the webpage, such as clicking buttons, filling out forms, and navigating between pages. These tools translate the agent's intent into tangible, visible results for the user.

This guide will break down the specific commands within each category, explaining what they do and why they are important for building a capable and intelligent voice assistant. We will begin with the foundational capability that must come before any action can be taken: perception.

Perception: Giving the Agent "Eyes" to See the Webpage

Before a voice agent can assist a user, it must first answer a fundamental question: "What is on the screen right now?" Without this understanding, any action it takes would be a blind guess. While a human can instantly grasp the layout of a webpage, an AI needs a machine-readable representation of that same environment. Simply feeding the raw HTML source code to a large language model is inefficient and often ineffective. Modern webpages are incredibly complex, filled with thousands of lines of code, tracking scripts, and styling information that are irrelevant to the user's immediate task. This "noise" can easily overwhelm an LLM's context window, making it difficult to extract the meaningful information needed to act.

To solve this, the Webfuse Automation API provides a specialized perception tool designed to translate a complex webpage into a clean, structured format optimized for AI analysis.

The Primary Perception Tool: `take_dom_snapshot()`

The core of the agent's perceptual ability is the take_dom_snapshot() command. This tool does far more than just copy the page's HTML. It intelligently processes the live Document Object Model (DOM) and serializes it into a more succinct and logical format. Think of it as creating a "map" of the webpage that highlights the important landmarks—like input fields, buttons, and blocks of text—while omitting the unnecessary clutter.

This tool is highly configurable to adapt to different situations:

DOM Downsampling: To manage the size and complexity of the output, the command can be run with a 'downsample' modifier. This feature uses an algorithm to intelligently reduce the DOM's density, stripping away nested decorative elements while preserving the core structure and interactive components. This is essential for ensuring the agent gets a clear picture of the page that fits within its processing limits.
Targeted Snapshots: An agent doesn't always need to see the entire page. If a user's request relates to a specific area, like a login form or a shopping cart, the rootSelector option can be used. This parameter instructs the tool to capture only a specific portion of the DOM, providing a highly focused view that makes the agent's decision-making process faster and more accurate.
Handling Sensitive Information: In many scenarios, a webpage may contain sensitive data that has been masked by another Webfuse application for security or compliance. By default, take_dom_snapshot() respects this masking and excludes the sensitive data from its output. However, for trusted automation workflows, a revealMaskedElements flag can be set, allowing the agent to "see" the masked content if its task requires it.

The Visual Perception Tool: `take_gui_snapshot()`

While a DOM snapshot provides structural understanding, it doesn't capture the visual presentation of the page—how it actually looks to the user. For this, the Automation API provides a second perception tool: take_gui_snapshot(). This command is straightforward: it takes a screenshot of the current view and returns it as an image.

This tool becomes particularly powerful when working with multimodal LLMs that can process both text and images. A visual snapshot can help the agent understand:

Styling and Layout: It can see which buttons are highlighted, which text is prominent, and how elements are arranged visually on the screen.
Non-DOM Content: It can perceive content rendered in a <canvas> element, such as charts, graphs, or custom product viewers, which are invisible in a DOM snapshot.
Stateful UI Elements: It can confirm the visual state of an element, such as whether a checkbox is checked or a toggle is active.

By combining the structural map from take_dom_snapshot() with the visual context from take_gui_snapshot(), an agent can build a comprehensive and nuanced understanding of the user's environment. With this clear perception established, it is now ready to take meaningful action.

Action: Giving the Agent "Hands" to Interact with the Webpage

Once an agent has a clear understanding of the webpage, the next logical step is to take action. Perception without the ability to act is passive; it is the combination of the two that creates a truly helpful assistant. The Webfuse Automation API provides a suite of action tools designed to simulate the fundamental ways a human user interacts with a website: moving the mouse, clicking on elements, and typing on the keyboard.

These tools are intentionally high-level. They abstract away the complexities of low-level browser scripting, allowing the agent to operate based on intent. For example, instead of needing to script the creation and dispatch of a synthetic mouse event, the agent can simply issue a left_click command. This approach is not only simpler but also more robust, as it mimics user behavior more closely.

The Targeting Mechanism: Specifying a Course of Action

Before any action can be performed, the agent must specify what to interact with. The Automation API provides a flexible targeting system that can be used with most action commands:

Virtual Mouse Pointer: The API maintains the position of a virtual mouse cursor within the session. The agent can first move this cursor to a specific location and then perform an action (like a click) at that spot. This is useful for interacting with elements that are difficult to identify with a selector, such as items on a graphical canvas.
CSS Selectors: The most common method of targeting is to provide a standard CSS selector. After analyzing the DOM snapshot, the agent can generate a precise selector (e.g., #login-button or input[name='email']) to identify the exact element it needs to interact with.

Simulating Mouse Interactions

The most common way users interact with the web is by pointing and clicking. The Automation API provides a set of tools to replicate this behavior seamlessly.

mouse_move(target): This is the primary command for positioning the agent's focus. It moves the virtual mouse pointer to the element or coordinates specified by the target. This is often the first step in a multi-part action, setting the stage for a subsequent click or scroll.
left_click([target]): This is the workhorse of web interaction. It performs a standard left-click on the specified target. If no target is provided, it clicks at the current location of the virtual mouse pointer. This command is used for everything from activating buttons and following links to focusing input fields before typing.
middle_click([target]) and right_click([target]): While less common, these tools provide the full range of mouse capabilities, allowing the agent to open links in new tabs (middle-click) or access context menus (right-click) when a specific workflow requires it.

Simulating Keyboard Input

For any task involving data entry, the agent needs the ability to type.

type(text, [target]): This command allows the agent to enter text into a form field. It simulates natural human typing, sending a sequence of key presses rather than just programmatically setting the element's value. This ensures that any JavaScript event listeners on the website (e.g., for input validation or auto-completion) are triggered correctly.
key_press(key, [target], options): For more granular control, this tool can be used to simulate a single key press, such as hitting 'Enter' to submit a form or 'Tab' to move to the next field. It also supports modifier keys like Shift or Ctrl.

Controlling the Viewport

Webpages are often longer than the visible screen. The agent needs to be able to find content that is "below the fold."

scroll(direction, amount, [target]): This command scrolls the page or a specific scrollable element within the page. The agent can use this to navigate down a long article, browse through a product list, or ensure that a specific form field is visible before attempting to interact with it.

Together, these perception and action tools form a complete feedback loop. The agent can see the state of the page, decide on the next logical step, and act on that decision. This cycle of perceive-decide-act is the fundamental process that transforms a conversational AI into a capable, agentic partner, ready to assist users with any task on any website.

Conclusion: From a Set of Tools to a Capable Agent

The perception and action tools detailed in this guide—from take_dom_snapshot to left_click and type—are more than just a list of functions. Together, they form a complete and logical system that enables a voice agent to intelligently interact with any web environment. They provide the essential building blocks for the core operational cycle of any effective agent: perceive, decide, and act.

Perceive: Using take_dom_snapshot, the agent gains a clear, structured understanding of the user's current view, allowing it to see the available options and identify the relevant elements for a given task.
Decide: With this contextual information, the agent's underlying large language model can make an informed decision about the next logical step required to fulfill the user's request.
Act: The agent then executes that decision using the appropriate action tool, whether it's clicking a button, typing into a form, or scrolling the page.

This continuous loop transforms the agent from a passive conversationalist into an active participant. It is no longer following a rigid, pre-programmed script that is bound to fail when a website's layout changes. Instead, it can dynamically adapt to the live state of the page, making it a far more resilient and reliable assistant.

For the end-user, the technical complexity of this cycle is invisible. What they experience is an assistant that understands not just their words, but the context of their digital environment. When they ask for help, the agent can see what they see and provide immediate, on-screen assistance. This creates an interaction that feels direct, responsive, and genuinely helpful.

Ultimately, this toolkit provides the fundamental grammar for teaching an AI the language of the web. By giving a voice agent the digital eyes to see and hands to act, developers can move beyond simple conversational interfaces and begin to build the next generation of sophisticated, autonomous assistants that can navigate the web's complexities on our behalf.

Ready to put this knowledge into practice? In Part 4: Building a Website-Controlling Voice Agent with ElevenLabs and Webfuse, we'll walk through a complete hands-on implementation, showing you how to configure your agent, build the Session Extension, and deploy a working prototype.

How Voice Agents See & Act: A Guide to DOM Tools

Perception: Giving the Agent "Eyes" to See the Webpage

The Primary Perception Tool: `take_dom_snapshot()`

The Visual Perception Tool: `take_gui_snapshot()`

Action: Giving the Agent "Hands" to Interact with the Webpage

The Targeting Mechanism: Specifying a Course of Action

Simulating Mouse Interactions

Simulating Keyboard Input

Controlling the Viewport

Conclusion: From a Set of Tools to a Capable Agent

Next Steps

Ready to Get Started?

Stay Updated

DOM Downsampling for LLM-Based Web Agents

A Gentle Introduction to AI Agents for the Web

How Voice Agents See & Act: A Guide to DOM Tools

Perception: Giving the Agent "Eyes" to See the Webpage

The Primary Perception Tool: take_dom_snapshot()

The Visual Perception Tool: take_gui_snapshot()

Action: Giving the Agent "Hands" to Interact with the Webpage

The Targeting Mechanism: Specifying a Course of Action

Simulating Mouse Interactions

Simulating Keyboard Input

Controlling the Viewport

Conclusion: From a Set of Tools to a Capable Agent

Next Steps

Ready to Get Started?

Stay Updated

DOM Downsampling for LLM-Based Web Agents

A Gentle Introduction to AI Agents for the Web

The Primary Perception Tool: `take_dom_snapshot()`

The Visual Perception Tool: `take_gui_snapshot()`