Voice Agent Web Control: 3 Methods Compared

Voice agents are becoming a major part of how users interact with websites and applications. These AI-driven systems can understand and respond to spoken language, allowing for natural, hands-free conversation. Instead of clicking through menus or typing in search bars, users can simply state what they need, and the agent can guide them, answer questions, or perform tasks.

However, the true potential of a voice agent is realized when it can move beyond conversation and take direct action within the application. This is made possible through a capability known as tool calling or function calling. This allows the agent to interact with and control its environment, bridging the gap between spoken commands and tangible results.

For developers, this means equipping the agent with a set of tools—defined functions or APIs—it can use to fulfill user requests. When a user asks for something that requires an action, like filling out a form or navigating to a new page, the agent intelligently selects and executes the appropriate tool.

To make a voice agent truly effective on the web, it needs a reliable way to perform these actions. There are three primary methods for enabling an agent to interact with a website. In this article, we'll explore the trade-offs of each approach, and then in Part 2, we'll dive deep into implementing the most practical method using Webfuse.

Server-Side Automation: This approach uses tools like Puppeteer or Playwright to control a web browser running on a server. The agent sends commands to this remote browser to navigate pages and manipulate content.
Direct API Calls: If a website exposes a public API, an agent can be programmed to interact with it directly. This is a stable way to perform actions like fetching data or submitting information.
Client-Side Tool Calling: This method embeds the agent's tools directly within the user's web session. The agent becomes a participant in the user's live experience, seeing what the user sees and acting on the same page. This is the approach Webfuse specializes in. By using a Webfuse Extension, an ElevenLabs voice agent can be injected into any website.

Before examining the methods for connecting an agent to a website, it helps to first understand the core mechanics: How does a voice agent translate a spoken request into a concrete, executable task?

Understanding Tool Calling: How Voice Agents Take Action

Voice agents are powered by a large language model (LLM) that excels at understanding and generating human language. By itself, however, the model is confined to conversation; it can talk, but it can't act. Tool calling is the mechanism that gives the agent its hands and eyes, allowing it to interact with the digital world beyond simple text-based responses. It transforms the agent from a passive conversationalist into an active participant capable of executing tasks.

Think of a smart home assistant. When you ask it for the weather, it uses a weather tool to fetch data from an external service. When you ask it to turn on the lights, it uses a smart home tool to send a command to your light switch. In each case, the agent follows a distinct process:

Intent Recognition: The user makes a request, such as "Help me fill out this application form" or "Navigate to the contact page." The LLM analyzes this request to understand the user's underlying goal, or intent.
Tool Selection: Based on the recognized intent, the agent consults its predefined set of available tools. It determines which tool is best suited for the job. For a navigation request, it would select a "relocate" or "redirectToURL" tool. To fill a form, it would need "type" and "leftClick" tools.
Parameter Formulation: Once a tool is selected, the agent identifies the necessary information, or parameters, to execute it. For the "type" tool, the parameters would be the text to enter and a CSS selector to identify the correct input field on the page.
Execution: The agent calls the selected tool with the formulated parameters. The tool then performs the action in its designated environment—be it making an API call, running a browser command, or manipulating the current webpage.
Response and Confirmation: After the tool has run, it may return a status or result. The agent uses this information to provide feedback to the user, confirming that the action was completed or reporting any issues that arose.

This capability is what makes modern AI agents so useful. By equipping them with a well-defined set of tools, developers can grant them the power to perform complex, multi-step tasks autonomously. The agent can see the state of a webpage, reason about the next logical action, and execute it, creating a fluid, interactive experience for the user. The effectiveness of this entire process, however, depends greatly on how these tools are implemented, which brings us to the different methods of connecting an agent to a website.

Method 1: Server-Side Automation with Puppeteer and Playwright

One of the most established methods for enabling an agent to interact with a website is through server-side automation. This approach uses libraries like Google's Puppeteer or Microsoft's Playwright to programmatically control a web browser instance running on a server. The voice agent's "tools" are functions that send commands to this remote browser, instructing it to perform actions just as a human user would.

The process works as follows: a user speaks a command to the voice agent. The agent, running on a backend server, translates this into a series of instructions. These instructions are then executed by a tool like Puppeteer, which drives a headless browser (a browser without a graphical user interface) to navigate to the target website, click elements, and enter text. For the agent to understand the page, it can command the server-side browser to send back the website's HTML content or take a screenshot, which the agent can then analyze.

Advantages of Server-Side Automation

This method offers a high degree of control and capability. Since it operates a full browser, it can handle complex JavaScript-heavy websites and perform almost any action a user can, from filling out multi-page forms to navigating through intricate menus. The entire process is managed centrally on the backend, which can simplify deployment and scaling, as no modifications are needed on the website itself or the user's machine.

Major Limitations

The primary drawback of server-side automation is the disconnect between the agent's environment and the user's. The agent is interacting with a completely separate browser session on a remote server. The user cannot see the agent's actions in real-time on their own screen. If the agent fills out a form, the user only sees the result after the fact, perhaps when the page reloads with new information.

This creates several issues for interactive assistance:

Lack of Collaboration: It's not a collaborative experience. The user and the agent are in two different worlds, making it impossible for the agent to guide the user visually on their own screen.
Session Synchronization: Keeping the user's and the agent's sessions in sync is a major challenge. If a user is logged into an account, the agent must also log in separately in its own browser, which can be difficult to manage securely and may not even be possible with some authentication systems.
Latency and Feedback: There is a noticeable delay between the user's request, the action on the server, and the feedback. This makes the interaction feel slow and disjointed rather than immediate and conversational.

Because of these limitations, server-side automation is most suitable for non-interactive, backend tasks like data scraping, automated testing, or report generation. It is less effective for use cases requiring a voice agent to provide real-time, on-screen assistance to a user.

Method 2: Direct API Calls

Another approach for enabling a voice agent is to have it interact directly with a website's Application Programming Interface (API), if one is available. An API serves as a structured gateway to a website's backend services, designed for machine-to-machine communication. In this model, the agent's tools are not browser commands but functions that make specific, authenticated HTTP requests to the website's servers to fetch data or trigger actions.

When a user issues a command like, "What's the status of my recent order?", the voice agent can translate this into a direct API call. It would formulate a request to an endpoint such as /api/orders/latest, receive a structured response (typically in JSON format), and then articulate that information back to the user. This method bypasses the user interface entirely, interacting directly with the data layer.

Advantages of Direct API Calls

The major benefit of this method is its reliability. Unlike a website's visual layout, which can be redesigned frequently, APIs are built to be stable contracts. Changes are typically versioned, making integrations less likely to break unexpectedly. Actions performed via an API are also highly efficient, as they do not require the overhead of loading and rendering a full webpage. This results in faster response times for the user. For services that provide a well-documented API, this can be a very dependable way to automate specific tasks.

Major Limitations

The biggest limitation of the API-based approach is its lack of universal applicability. The reality is that the vast majority of websites do not offer public APIs that expose the full range of functionality available through their user interface. More often, available APIs are limited to a small set of data retrieval tasks and do not support complex actions like completing a multi-step form or managing account settings.

This creates several important constraints:

Requires Custom Integration: This method is not a general solution. It demands a bespoke integration for each target website. An agent equipped with tools for one company's API cannot use them on a competitor's site, making the approach difficult to scale.
Inability to Interact with the UI: Because the agent is interacting with the backend, it is completely blind to what the user is seeing on their screen. It cannot assist with website navigation, identify which button to click, or provide contextual guidance based on the visual layout.
Dependence on Vendor Provision: The agent's capabilities are entirely dependent on what the website owner decides to expose through their API. If a function isn't available via the API, the agent simply cannot perform it.

While suitable for specific, pre-defined integrations with services that have robust APIs, this method is not a practical solution for creating a voice agent that can assist users across a wide variety of websites.

Method 3: Client-Side Tool Calling

The third method, client-side tool calling, addresses limitations of both server-side automation and direct API calls by operating directly within the user's live browser session. Instead of controlling a remote browser or communicating with a backend server, the agent's tools are executed as JavaScript commands on the webpage the user is currently viewing. This creates a direct, real-time connection between the voice agent and the web application.

Webfuse enables this through a specialized Session Extension. This extension injects the voice agent widget code such as ElevenLabs, into any website loaded within a Webfuse space, effectively making the agent a participant in the user's session. The extension also equips the agent with a powerful set of client-side tools, allowing it to perceive and interact with the webpage just as a user would.

The process is highly interactive:

Perceiving the Environment: To understand what the user is seeing, the agent can call a tool like take_dom_snapshot(). This function captures the current HTML structure of the page, providing the agent with a detailed map of all the visible elements, from text content to input fields and buttons.
Taking Action: To act on the user's behalf, the agent uses tools like left_click() and type(). When a user asks the agent to fill in a form field, the agent identifies the correct CSS selector from the DOM snapshot and executes the type() command with the required text. The user sees the text appear in the input field instantly.

Advantages of Client-Side Tool Calling

This approach offers several important benefits that make it highly suitable for interactive voice assistance:

Real-Time Interaction: The agent and the user share the exact same view and session. When the agent performs an action, it is immediately visible on the user's screen. This creates a true co-browsing experience where the agent can guide the user visually and collaboratively.
Universal Application: This method works on virtually any website because it interacts with the rendered HTML, the same content a user sees in their browser. It does not require any backend integration, vendor cooperation, or pre-existing APIs.
Contextual Awareness: The agent is always aware of the user's current context because it is operating in the same environment. It knows which page is loaded and what elements are visible, allowing it to provide accurate, relevant assistance.
Simplified Session Management: Since the agent acts within the user's existing session, there are no complex synchronization problems. If the user is logged in, the agent has the same level of access without needing to handle separate credentials.

By bringing the agent's actions directly into the user's browser, client-side tool calling creates a direct and highly responsive experience. It transforms the voice agent from an external tool into an integrated assistant, capable of providing seamless, on-screen guidance on any website.

Comparing the Methods: A Side-by-Side Look

To better understand the strengths and weaknesses of each approach, the following table provides a direct comparison across several key factors.

Feature	Server-Side Automation	Direct API Calls	Client-Side Tool Calling
Execution Environment	Remote server-controlled browser.	Website's backend services.	User's live browser session.
Real-Time User Feedback	Low. Actions are hidden on a remote server.	None. Bypasses the UI entirely.	High. Actions are immediately visible on the user's screen.
Universality/Setup	High. Works on most websites; requires backend setup.	Low. Requires custom API from the website vendor.	Very High. Works on any website without vendor cooperation.
Latency	Noticeable delay due to server round-trip.	Very low, as it's a direct data exchange.	Very low. Executed locally in the user's browser.
Session & Login	Complex; requires separate agent login management.	Moderate; requires secure API key management.	Simple; operates within the user's existing session.
Best For	Non-interactive tasks like data scraping and testing.	Specific, stable data retrieval and submission tasks.	Real-time, collaborative, on-screen user assistance.

Conclusion: Choosing the Right Path for Agent Interaction

The evolution of voice agents from conversational partners to active digital assistants marks a major shift in user experience. The ability to understand a user's intent and translate it into direct action on a website is no longer a futuristic concept but a practical reality. As we have seen, however, the effectiveness of this capability is not just about the intelligence of the agent, but about the method used to connect it to the digital world.

Each of the three primary methods—server-side automation, direct API calls, and client-side tool calling—offers a distinct set of advantages and limitations.

Server-side automation provides a high degree of control for backend processes but falls short in creating an interactive, real-time experience for the end-user.
Direct API calls offer a stable and efficient solution for specific, pre-defined tasks but are limited by the scarcity of comprehensive public APIs, making them unsuitable as a universal solution.
Client-side tool calling, by operating directly within the user's live browser session, emerges as the most suitable approach for building truly assistive voice agents. It resolves the core challenges of latency and session synchronization, enabling an agent to see what the user sees and act on their behalf instantly. This method creates a genuinely collaborative environment where the agent can provide on-screen guidance and execute tasks on any website, without requiring backend access or custom integrations.

Ultimately, the choice of implementation depends on the goal. For automated, non-interactive workflows, server-side tools remain a solid choice. For integrating with specific, API-driven services, direct calls are ideal. But for the future of user assistance—where AI agents act as real-time guides to help users navigate the complexities of the web—the client-side approach provides the most direct and effective path forward. By bringing the agent's intelligence directly into the user's world, we can create web experiences that are not only conversational but genuinely helpful.

In Part 2: How to Build a Voice Agent, we'll explore the technical architecture that makes this client-side approach possible, including Virtual Web Sessions and Session Extensions. Then in Part 3, we'll examine the specific perception and action tools your agent needs to interact with web content effectively.

Voice Agent Web Control: 3 Methods Compared

Understanding Tool Calling: How Voice Agents Take Action

Method 1: Server-Side Automation with Puppeteer and Playwright

Method 2: Direct API Calls

Method 3: Client-Side Tool Calling

Comparing the Methods: A Side-by-Side Look

Conclusion: Choosing the Right Path for Agent Interaction

Next Steps

Ready to Get Started?

Stay Updated

DOM Downsampling for LLM-Based Web Agents

A Gentle Introduction to AI Agents for the Web