Architecture of a Web-Controlling Voice Agent

October 12, 202510 min read

In the previous article, we established that for a voice agent to provide genuine, interactive assistance, it must operate directly within the user's live browser session. This client-side approach avoids the latency of server-side automation and the limitations of direct API calls, creating a truly collaborative experience. However, this conclusion presents a major architectural challenge: how can an AI agent be safely injected into a third-party website it doesn't control, without requiring the end-user to install any software?

Solving this requires a system that can create an interactive overlay on any web application, effectively virtualizing the browsing session. This is the core principle behind the Webfuse platform. Instead of attempting to modify the target website’s source code or relying on vendor cooperation, Webfuse establishes a proxy-based virtualization layer that intercepts and modifies web content in real-time. This creates a secure, sandboxed environment where new functionality, such as an ElevenLabs voice agent, can be introduced.

This article provides a technical deep dive into the Webfuse architecture. We will move from the "what" to the "how," explaining the mechanics of Virtual Web Sessions and Session Extensions. We will break down how this platform bridges the gap between a cloud-based AI agent and the live Document Object Model (DOM) of any website, enabling the powerful client-side tool calling necessary for seamless, on-screen automation.

The Architectural Foundation: Virtual Web Sessions

To execute code within a third-party website, a platform must first establish a controlled environment around it. This is the primary function of Webfuse's architecture, which is built on the concept of Virtual Web Sessions. A Virtual Web Session is a proxied and augmented instance of a web application, created in real-time. From the end-user's perspective, they are interacting with the target website, but every piece of data—every request and response—is routed through Webfuse's core component: the Augmented Web Proxy.

This proxy acts as an intelligent intermediary between the user's browser and the target application's servers. When a user accesses a Webfuse link (e.g., webfuse.com/+myspace?url=https://targetwebsite.com), the proxy initiates a session. It fetches the original website's content on the user's behalf and then dynamically modifies it before delivering it to the user's browser. This on-the-fly transformation is what creates the virtual session and is comprised of two major operations:

  1. Domain and URL Rewriting: The first and most important step is to ensure that all future interactions remain within the Webfuse environment. The proxy achieves this by rewriting the domain of the target application. For example, a request to targetwebsite.com is served from a Webfuse-controlled domain. More importantly, the proxy scans the entire content of the HTML, CSS, and JavaScript files for any URLs. It systematically finds and alters every resource path—image sources (<img>), link destinations (<a>), and script locations—to route them back through the proxy. This prevents the user's browser from making direct requests to the original server, which would break the virtual session.
  2. Header and Security Policy Modification: Modern web applications use security headers like Content Security Policy (CSP) or X-Frame-Options to prevent them from being embedded or modified by external sources. These are important security measures for a standard website, but they would normally block the kind of augmentation Webfuse is designed to perform. The Augmented Web Proxy intelligently modifies these headers in the response before it reaches the browser. It adjusts the policies to permit the injection of new scripts and UI elements, allowing the voice agent and other extensions to function without being blocked by the browser's security model.

Through this process, Webfuse creates a completely self-contained and isolated browsing context. The original website operates as intended, but it is now wrapped in a virtualization layer. This layer is the foundation upon which additional capabilities can be built. It provides a secure and reliable execution environment where custom code can be introduced, all without altering the source code of the target application or requiring any installation on the user's device. This sets the stage for the next logical component: the mechanism for injecting and running the agent's code.

The Role of Session Extensions

With a stable Virtual Web Session established, the next architectural step is to run the voice agent and its associated logic. This is accomplished through Webfuse Session Extensions. A Session Extension is a package of custom JavaScript, HTML, and CSS that is loaded directly into the virtualized environment. Functionally, these extensions are similar to standard browser extensions (like those for Chrome or Firefox), but with one major difference: they run entirely within the Webfuse platform, not in the end-user's browser. This distinction is what upholds the zero-installation promise for the user.

When a user initiates a Virtual Web Session, the Augmented Web Proxy injects the extension's files into the sandboxed browsing context. For our use case, the extension performs two primary functions:

  1. Establishing the Agent Overlay: Instead of directly injecting the agent into the target website's main content, the Session Extension is configured to launch the agent in a separate, non-intrusive UI layer. The extension utilizes the browser's built-in pop-up mechanism (popup.html) to host the <elevenlabs-convai> widget. A dedicated script (popup.style.js) detaches this pop-up and positions it as a persistent, floating overlay on the screen. This creates a dedicated, always-accessible voice interface that does not interfere with the target website's original layout, styling, or DOM structure.
  2. Providing the Toolset via the Extension API: The more important function of the extension is to serve as the bridge between the agent's brain and the webpage's body. The extension's JavaScript code (popup.agent.js) runs within the secure, sandboxed environment provided by Webfuse (in the context of the pop-up). This sandbox gives the extension access to the powerful Webfuse Extension API, a client-side JavaScript API designed for interacting with the virtual session.

This is where tool calling becomes a practical reality. The extension's JavaScript defines a set of functions that correspond to the tools the ElevenLabs agent needs, such as take_dom_snapshot(), left_click(), and type(). These client-side tool functions are then passed to the ElevenLabs agent widget during its initialization within the overlay.

The operational flow is as follows:

  1. The user speaks a command to the ElevenLabs voice agent in the floating overlay.
  2. The agent's LLM processes the command and determines that an action is needed, such as clicking a button.
  3. The agent selects its left_click tool and calls it with the necessary parameter (the CSS selector for the button).
  4. This call is received by the corresponding JavaScript function within our Session Extension, running in the pop-up context.
  5. The function then uses the Webfuse Extension API to execute the command, for example, by calling browser.webfuseSession.automation.left_click(selector).
  6. Webfuse executes this command within the virtual session, and the user sees the button being clicked on their screen in real time.

In this model, the Session Extension acts as the vital component. It translates the abstract intentions of the AI agent into concrete actions on the webpage, using the secure and capable API provided by the Webfuse platform. This architecture allows a cloud-based AI to gain precise, real-time control over any website, effectively giving it the "hands and eyes" needed to perform tasks directly within the user's live experience.

The Automation API: A Closer Look at the Agent's Toolkit

The bridge between the Session Extension and the live webpage is the Webfuse Automation API. This is a specialized, high-level component of the broader Extension API, designed specifically for agentic AI interaction. It provides a set of commands that are more akin to human actions than traditional, low-level DOM manipulation. Instead of writing complex JavaScript to find an element and then dispatching a synthetic click event, the agent can issue a simple, intent-driven command. This abstraction is what makes the integration both powerful and reliable.

The Automation API is structured around the two primary functions of any agent: perceiving its environment and acting within it.

1. Perceiving the Environment: The Eyes of the Agent

Before an agent can act, it must first understand the context of the user's screen. A raw HTML dump is often too verbose and noisy for a large language model to process effectively. The Automation API addresses this with the take_dom_snapshot() command. This tool does more than simply grab the page source; it processes the DOM into a structured and succinct format that is optimized for an LLM. It can be configured to downsample the content, reducing a complex webpage into a clean representation that preserves the essential layout, interactive elements (like buttons and forms), and textual content.

When the ElevenLabs agent needs to understand the current page, it calls this tool. The Session Extension executes browser.webfuseSession.automation.take_dom_snapshot(), and the resulting structured text is fed back to the LLM. This gives the agent a clear, machine-readable "view" of the webpage, allowing it to identify the necessary CSS selectors for subsequent actions.

2. Taking Action: The Hands of the Agent

Once the agent has perceived the page and identified its target, it uses a series of action-oriented tools. These commands are designed to simulate direct user interaction:

  • left_click(selector): This command performs a click on the element specified by the CSS selector.
  • type(text, selector): This command focuses on the targeted input field and types the provided text, character by character.
  • scroll(direction, amount): This allows the agent to scroll the page up or down to bring different elements into view.

These high-level commands are more resilient than custom scripts. They are designed to interact with the webpage's UI layer, much like a human user would, making them less likely to break due to minor changes in the target site's underlying code. The Session Extension simply translates the agent's tool call into the corresponding Automation API command. For example, a request from the agent to fill in a username field becomes a call to browser.webfuseSession.automation.type("test_user", "#username"). The user then sees the text appear on their screen instantly.

By providing this specialized toolkit, the Webfuse architecture abstracts away the complexity of browser automation and this entire process occurs within the secure sandbox of the Virtual Web Session, ensuring that the powerful capabilities of the agent are exercised in a controlled and safe manner.

Conclusion: A Unified Architecture for Agentic AI

The Webfuse architecture, combining Virtual Web Sessions, Session Extensions, and a high-level Automation API, provides a complete and cohesive solution for the challenges of web-based AI interaction. It creates a purpose-built environment where a cloud-based AI, like an ElevenLabs voice agent, can safely and effectively control any website directly within a user's live session.

Let's trace the entire process from start to finish with a simple user request: "Fill in my username."

  1. Initiation: The user accesses a Webfuse link. The Augmented Web Proxy intercepts the request and establishes a Virtual Web Session, rewriting all content to flow through its controlled environment.
  2. Injection: As the target page is loaded, the proxy injects the Session Extension. This extension adds the ElevenLabs agent widget to the page and prepares its toolkit of functions.
  3. Perception: When the user speaks, the agent first needs to understand the page. It calls the take_dom_snapshot tool. The Session Extension executes this command via the Automation API, generating a clean, LLM-optimized representation of the page's content, which is sent back to the agent.
  4. Action: The agent's LLM processes the user's command and the DOM snapshot, identifies the correct CSS selector for the username field, and selects its type tool. It invokes the tool with the username text and the identified selector.
  5. Execution: The Session Extension receives the call and translates it into a Webfuse Automation API command: browser.webfuseSession.automation.type(...). This command is executed within the sandboxed virtual session.
  6. Feedback: The user sees the username field fill in on their screen in real-time, providing immediate visual confirmation that their command was understood and executed.

This tightly integrated, client-side approach directly solves the core problems:

  • Universal Applicability: By interacting with the rendered DOM, the architecture is not dependent on vendor-specific APIs. It can be applied to virtually any website.
  • Real-Time Collaboration: The agent and the user share the same session. Actions are instant and visible, eliminating the latency and disconnect inherent in server-side automation.
  • Simplified Session Management: The agent operates within the user's existing login state, avoiding the complex and often insecure need to manage separate credentials.
  • A Secure, Sandboxed Environment: All actions are mediated through the Webfuse platform's APIs, providing a controlled and secure execution context that protects both the user and the target application.

Ultimately, the architecture of Voice Agents deployed via a general-purpose web augmentation platform demonstrates that the most effective path to building assistive AI is not to operate from a distance, but to embed the agent's intelligence directly into the user's live session, creating a web experience that is truly interactive, helpful, and responsive.

Now that you understand the architecture, Part 3: Perception and Action Tools will dive into the specific capabilities these tools provide. Then in Part 4, we'll walk through a hands-on implementation using ElevenLabs and Webfuse.

Related Articles