Building a Voice Agent with Vapi and Webfuse

Voice agents that can control a live website represent a new frontier in human-computer interaction. Instead of clicking through menus and forms, a user simply speaks - and the agent navigates, clicks, types, and reads on their behalf. In a previous tutorial, we explored how to achieve this with ElevenLabs. In this guide, we take the same concept and build it with Vapi, a developer-focused voice AI platform that supports client-side tool calling out of the box.

By the end of this tutorial, you will have a working Webfuse Extension that:

Renders a floating voice orb on any proxied website
Connects to a Vapi-powered voice assistant on click
Gives the assistant the ability to read the page, click elements, type into fields, press keys, and navigate - all through the Webfuse Automation API

What You Will Need:

A Vapi Account: Free tier available at vapi.ai. You will need your Public Key.
A Webfuse Account: Required to create a Space and deploy the extension. Sign up at webfuse.com.
Node.js 18+ and pnpm (or npm/yarn) installed locally.

The Full Source Code is available on GitHub.

Why Vapi?

Vapi is a voice AI platform built for developers. Unlike platforms that abstract away the model layer, Vapi gives you full control over the LLM provider, system prompt, voice, transcriber, and - critically - client-side tools. This means the voice agent's tool calls can be handled entirely within the browser, with no server-side relay required.

For our use case, this is ideal. The Webfuse Automation API is a client-side API available within Session Extensions. When the voice agent decides to click a button, it issues a tool call that executes directly in the browser session - no round-trip to a server. No assistant needs to be pre-configured in the Vapi Dashboard - the model, prompt, voice, and tools are all defined inline in code.

Step 1: Clone, Install, and Build

git clone https://github.com/webfuse-com/extension-vapi-voice-agent.git
cd extension-vapi-voice-agent
pnpm install
pnpm build

That's it. The build produces a dist/ folder with everything Webfuse needs:

dist/
  background.js    # Auto-opens the popup on session start
  content.js       # Automation API relay
  popup.html       # Orb widget UI
  popup.js         # Vapi SDK + tools (bundled)
  manifest.json    # Extension manifest

Step 2: Deploy to Webfuse

Go to Webfuse Studio and create a Space (Solo is perfect for this use-case)
In the Space, open Settings (gear icon) > Extensions > Install extension
Click Load unpacked in Default Storage and select the dist/ folder

Step 3: Configure Your API Key

You can set your VAPI_PUBLIC_KEY either in the manifest.json before building, or directly in Webfuse Studio after uploading:

In the Extensions panel, click on Vapi Voice Widget
Click Configure next to Environment Variables
Set VAPI_PUBLIC_KEY to your Vapi public key (found in the Vapi Dashboard)

Open a Session in your Space. The orb appears automatically. Click it, grant microphone access, and start talking.

How It Works

With the extension running, let's look under the hood at how the pieces fit together.

Diagram showing how the Vapi voice extension fits together with Webfuse

Architecture

The extension is structured around three components, each running in a distinct context within Webfuse:

View table data

Component	Context	Role
Popup (`popup.ts`)	Extension page	Runs the Vapi SDK, handles voice + audio, renders the orb UI, processes tool calls
Content (`content.ts`)	Tab page	Thin automation relay - receives tool call messages and executes them via the Webfuse Automation API
Background (`background.ts`)	Service worker	Auto-opens the popup when the session starts

For a detailed investigation into why this architecture is necessary, see the REPORT.md file in the github repository.

The Voice Connection

The popup initializes the Vapi SDK and starts a call with a fully inline assistant configuration:

vapi.start({
  model: {
    provider: "openai",
    model: "gpt-4o",
    messages: [{ role: "system", content: systemPrompt }],
    tools: vapiTools,
  },
  transcriber: { provider: "deepgram", model: "nova-2", language: "en" },
  voice: { provider: "vapi", voiceId: "Elliot" },
  name: "Webfuse Assistant",
  firstMessage: "Hey! I can help you interact with this page. What would you like to do?",
  clientMessages: ["tool-calls", "transcript"],
});

The critical field is clientMessages: ["tool-calls", "transcript"]. This tells Vapi to deliver tool call events to the client SDK rather than routing them server-side. When the model decides to call click_element, the event arrives as a message in the popup, where our handler executes it locally.

When a tool call completes, the result is injected back into the conversation as a system message using vapi.send():

vapi?.send({
  type: "add-message",
  message: {
    role: "system",
    content: `[Tool "${name}" result]: ${resultStr}`,
  },
});

This ensures the model can read the output of its own tool calls - for example, after taking a DOM snapshot, the model receives the HTML and can describe what it sees or decide which element to target next.

Automation Tools

The file src/tools.ts defines the bridge between Vapi's tool calling system and the Webfuse Automation API. Each tool has a handler (the function that executes) and a definition (the schema Vapi sends to the LLM).

Here is the tool handler for clicking an element:

case "click_element": {
  await delegateAutomation("act", "click", params.target, {
    moveMouse: true,
    scrollIntoView: true,
  });
  return `Clicked "${params.target}"`;
}

The delegateAutomation helper sends a message from the popup to the content script, which calls the corresponding method on browser.webfuseSession.automation:

function delegateAutomation(
  automationScope: string,
  automationMethod: string,
  ...automationArgs: any[]
): Promise<any> {
  return browser.tabs.sendMessage(0, {
    automationScope,
    automationMethod,
    automationArgs,
  });
}

This delegation is necessary because the Automation API is only available in content scripts - it operates on the live page DOM.

The full set of tools:

Table of automation tools available to the voice agent - dom snapshot, click, type, key press, navigate

View table data

Tool	Automation Method	Purpose
`take_dom_snapshot`	`see.domSnapshot()`	Read the page structure with Webfuse IDs
`click_element`	`act.click()`	Click buttons, links, any element
`type_text`	`act.type()`	Type into input fields
`press_key`	`act.keyPress()`	Press Enter, Escape, Tab, arrow keys
`navigate_to`	`automation.navigate()`	Go to a different URL

The Orb UI

The floating orb widget is defined in popup.html. It uses CSS ping-ring animations to communicate the current call state visually:

These states are driven by CSS classes toggled from the popup script in response to Vapi events:

vapi.on("speech-start", () => setState("ai-speaking"));
vapi.on("speech-end", () => setState("connected"));

The orb also includes error handling. If a user clicks the orb without configuring their VAPI_PUBLIC_KEY, a toast notification slides in with a clear message guiding them to the extension settings.

The Manifest

The manifest.json includes host_permissions for all domains the Vapi SDK communicates with:

"host_permissions": [
    "https://cdn.jsdelivr.net/*",
    "https://api.vapi.ai/*",
    "https://*.daily.co/*",
    "https://c.daily.co/*",
    "wss://*.daily.co/*",
    "https://*.ingest.sentry.io/*"
]

Note that wss://*.daily.co/* is listed separately - WebSocket Secure URLs require their own permission. The env array defines VAPI_PUBLIC_KEY, which can be configured per-Space in Webfuse Studio without rebuilding.

What Happens Next

This extension is a starting point. From here, you might consider:

Expanding the toolset. The Webfuse Automation API offers additional methods like act.select() for dropdowns, act.textSelect() for highlighting text, and see.guiSnapshot() for sending a screenshot to the model. Each can be added as a new tool in src/tools.ts.
Adding the Session MCP Server. For more advanced orchestration, you can connect the Webfuse Session MCP Server to route automation through the Model Context Protocol, enabling multi-agent workflows and external tool registries.
Refining the system prompt. The default prompt in src/tools.ts is intentionally minimal. A production agent would benefit from detailed instructions about how to interpret DOM snapshots, when to use Webfuse IDs versus CSS selectors, and how to handle error recovery.

The full source code is available at github.com/webfuse-com/extension-vapi-voice-agent.

Building a Voice Agent with Vapi and Webfuse

Why Vapi?

Step 1: Clone, Install, and Build

Step 2: Deploy to Webfuse

Step 3: Configure Your API Key

How It Works

Architecture

The Voice Connection

Automation Tools

The Orb UI

The Manifest

What Happens Next

Further Reading

Build Your Own Voice-Controlled Web Agent

Frequently Asked Questions

Next Steps

Ready to Get Started?

Stay Updated

DOM Downsampling for LLM-Based Web Agents

A Gentle Introduction to AI Agents for the Web