Alibaba's Page Agent flips the usual browser-automation model on its head: instead of driving a browser from the outside with Selenium or Puppeteer, it runs an AI agent inside the page that reads the live DOM, reasons about it with an LLM, and clicks and types as the real user. This guide explains how that works, then walks through building a working support chatbot for a fictional e-commerce store that handles a lost-package case end-to-end.
The bigger lesson is about where web agents should live. Page Agent is a strong pattern when you control the application and can add code to the page. Many production workflows are messier: they happen in third-party apps, legacy tools, customer sessions, or locked-down environments where you cannot change the source code or ask every user to install an extension. That is the gap Webfuse is built around: making the live web session itself programmable, shareable, and agent-ready without depending on the original application's release cycle.
What Alibaba Page Agent Actually Does
Page Agent is an open-source JavaScript/TypeScript library developed by Alibaba that embeds a GUI AI agent directly into any webpage. It allows for natural language control over web interfaces through direct DOM interaction, requiring minimal or zero infrastructure. This approach is different from traditional browser automation tools like Selenium or Puppeteer, which operate externally. Page Agent runs entirely client-side within the browser. This means it functions inside the live page, inheriting the user's session, cookies, and authentication without needing headless browsers or Python backends.

The library follows an "inside-out" paradigm where the agent resides within the webpage itself rather than controlling it from an external process. It works by parsing and "dehydrating" the live HTML into semantic text, which it feeds to a large language model (LLM). The LLM reasons about the necessary actions, and the agent dispatches native browser events like clicks and typing to execute them.
Core Capabilities That Make Page Agent Different
Page Agent is built with a focus on direct integration and user collaboration. It includes several features suitable for developers looking to add AI copilot functionality to their applications.
- Text-Based DOM Manipulation: It parses the live HTML, indexes interactive elements, and translates this structure into semantic text for the LLM.
- Human-in-the-Loop: The agent has a built-in UI panel that shows planned actions and seeks user confirmation before execution, preventing fully blind automation.
- Privacy-First: All data processing happens in the browser, and you control the LLM calls. No data is sent to external servers by default.
- Bring Your Own Model (BYOM): It is compatible with any OpenAI-compatible API, offering support for models from OpenAI, Anthropic, DeepSeek, Google, and local models via Ollama or LM Studio.
- SPA Compatibility: It functions correctly with single-page applications built with frameworks like React or Vue because it interacts with the live DOM.
The project is a monorepo using npm workspaces and TypeScript. Its main packages include @page-agent/core for the core logic, page-agent for the full agent with a UI panel, and @page-agent/page-controller for DOM extraction and element indexing.
How to Add Page Agent to Your App
You can integrate Page Agent into a project using a simple script tag for quick tests or through npm for production use. The npm approach is recommended for most development workflows.
1. NPM Installation
First, add the package to your project.
npm install page-agent
Next, you can import and initialize it in your TypeScript or JavaScript code.
import { PageAgent } from 'page-agent';
const agent = new PageAgent({
model: 'gpt-5.1', // or another model like 'qwen3.5-plus'
baseURL: 'https://api.openai.com/v1', // Your provider's endpoint
apiKey: 'YOUR_API_KEY',
language: 'en-US',
});
await agent.execute('Fill out the expense report for last week\'s trip');
2. Simple Script Tag
For demos or quick tests without a build step, you can use the IIFE build from a CDN. This method allows you to get started with zero configuration.
<script src="https://cdn.jsdelivr.net/npm/page-agent@latest/dist/iife/page-agent.demo.js" crossorigin="true"></script>
3. Chrome Extension
An optional Chrome extension is available for more advanced use cases. It adds capabilities for multi-page and cross-tab navigation, which are not part of the core single-page library.
Where Page Agent Fits Best
Page Agent is designed for enhancing existing web applications with AI copilot features rather than for large-scale, server-side automation or scraping. It is well-suited for a variety of applications.
- SaaS AI Copilots: You can embed it into ERP, CRM, or admin panels to translate complex multi-step workflows into simple natural language commands.
- Accessibility Improvements: It can serve as the basis for a voice or natural language interface for any web application.
- Internal Tools: It can simplify processes on internal dashboards or legacy systems that require many clicks to perform a single task.
It is important to be aware of its limitations. The agent's reasoning capabilities are dependent on the quality of the connected LLM. Its core functionality is focused on single-page interactions, though an extension is available to manage multi-page tasks.
Ship AI Agents Into Any Website - No Source Code Needed
Page Agent shows what an in-page agent can do when you control the app. Webfuse is for the next problem: bringing agents, copilots, and custom workflows into live web sessions for apps you do not own, without browser installs.
Build a Lost-Package Support Agent
This walkthrough demonstrates how to use Page Agent to build a support chatbot for a typical e-commerce store. The agent will handle a common customer issue: a lost package.
The full source code is available on GitHub.
Demo Setup: ACME Outfitters
The demonstration takes place on a fictional e-commerce site called ACME Outfitters. This site has standard features like a product catalog, a shopping cart, an order history page, and an account management section. The support chatbot we will build appears as a floating action button in the bottom-right corner of the screen, expanding into a chat window when clicked.

Initialize Page Agent and Hide the Default Panel
The core logic for our agent is encapsulated in an agent.ts file. Here, we instantiate Page Agent and configure it to work with our custom chat interface. We connect it to Google Gemini via an OpenAI-compatible endpoint. A small but important configuration detail is to ensure the baseURL includes a trailing slash, as some API providers require it.
A major step is disabling the default user interface that comes with the library. Page Agent includes a built-in panel for showing its plan and asking for user confirmation. Since we are creating our own chat UI, we call agent.panel.dispose() immediately after initialization to prevent the default panel from appearing.
// src/agent.ts
import { PageAgent } from 'page-agent';
let agent: PageAgent | null = null;
export function initializeAgent() {
if (agent) {
return agent;
}
agent = new PageAgent({
model: 'gemini-1.5-pro-latest',
// The trailing slash is important for certain endpoints.
baseURL: 'https://generativelanguage.googleapis.com/v1beta/models/',
apiKey: 'YOUR_GEMINI_API_KEY',
language: 'en-US',
});
// Hide the default Page Agent UI panel.
agent.panel.dispose();
return agent;
}
export function getAgent() {
if (!agent) {
throw new Error('Agent not initialized.');
}
return agent;
}
Build the Chat Interface Around Stateless Agent Calls
The chat interface is a React component named SupportChat.tsx. It manages the visual elements of the conversation: a floating button to open the chat, a message area with bubbles for the user and the bot, a typing indicator to show when the agent is working, and an input field with a send button.
One of the main considerations when building a conversational agent is memory. Each call to agent.execute(task) is stateless; the agent has no memory of previous interactions. To create a continuous conversation, we need to provide the history manually. Before sending a new user message to the agent, we prepend the last six turns of the conversation transcript. This gives the LLM enough context to understand the user's intent and provide a relevant response.
// A simplified excerpt from src/components/SupportChat.tsx
// ... (imports and state management for messages, isLoading, etc.)
const handleSend = async () => {
setIsLoading(true);
const agent = getAgent();
// Create conversational history for the agent.
const history = messages.slice(-6).map(m => `${m.sender}: ${m.text}`).join('\n');
const task = `${history}\n\nUser: ${currentUserInput}`;
// Set the system prompt and execute the task.
agent.setSystemInstruction(SYSTEM_PROMPT); // SYSTEM_PROMPT is defined elsewhere
const response = await agent.execute(task);
// Add agent's final message to the chat.
setMessages(prev => [...prev, { sender: 'bot', text: response.result }]);
setIsLoading(false);
};
// ... (JSX for the chat window)
Write the Guardrails That Control the Agent
The agent's personality, rules, and workflow are defined in the system prompt. This text is sent to the LLM with every request and guides its behavior. A well-constructed prompt is needed for creating a reliable and helpful support agent.
Below is the full system prompt used for the ACME Outfitters chatbot.
You are a friendly and helpful customer support agent for ACME Outfitters.
Your goal is to resolve customer issues by taking actions on their behalf directly on the website.
**Your Rules:**
1. **Communicate Naturally**: Talk like a human. Do not narrate your actions.
- BAD: "Okay, I will now navigate to the /orders page to check the status of order ACM-10318."
- GOOD: "Let me check on that order for you. One moment."
2. **Follow a Two-Step Compensation Flow**: When a user's package is lost (shipped but not delivered), you must first diagnose the problem and then offer compensation options. Do not take action until the user chooses an option.
- Step 1: Check the order status. Confirm it was shipped but not delivered.
- Step 2: Offer the user two choices: (a) instant store credit or (b) a refund to their original payment method that will take 5-10 business days.
- Step 3: Wait for the user to reply with their choice.
- Step 4: Execute the chosen action.
3. **Diagnose Before Offering**: Always explain *why* you are offering a solution. For a lost package, you should state that you see it was shipped but not delivered according to the tracking information.
4. **Confirm Sensitive Actions**: Never submit a payment, apply a discount, or issue a refund without first presenting the options to the user and getting their explicit confirmation. For refunds, you will use the built-in refund modal on the order details page.
5. **Be Concise**: Keep your answers short and to the point.
This prompt establishes several important guardrails.
- The first rule ensures the user experience feels like a conversation, not a technical log of an automation script.
- The two-step flow is the core of the business logic. It forces the agent to assess a situation, present choices, and wait for input before acting. This human-in-the-loop pattern is a good practice for actions with financial consequences.
- The diagnosis rule builds user trust by showing the agent has correctly understood the problem.
- The final rule on sensitive actions is a safety measure to prevent the agent from making unapproved changes to a user's account.
Run the Full Lost-Package Workflow
We can now walk through a complete lost-package scenario to see how the UI, the agent, and the prompt work together.
1. Customer Initiates Chat. The user, having not received their order, opens the chat widget and types their complaint.
Customer: "My package never arrived. Order ACM-10318."
2. Bot Diagnoses and Responds. The agent receives the message, prepended with the (empty) chat history. Guided by the system prompt, it first navigates to the /orders page, finds order ACM-10318, and inspects its status. It sees the order was shipped five days ago but the tracking information does not confirm delivery. Following the two-step flow, it presents the compensation options.
Bot: "I'm sorry to hear about that. I see that order ACM-10318 shipped 5 days ago, but the tracking shows it hasn't been delivered. I can help you with that. I can offer you two options:
- An instant store credit for the full amount of $189.00.
- A refund to your original payment method, which should appear in 5–10 business days."

3. Customer Makes a Choice. The user prefers the immediate resolution.
Customer: "Store credit please."
4. Bot Takes Action. This message triggers the agent to execute the second part of its instructed flow. It navigates back to the order details page for ACM-10318 and opens the refund modal. Inside the modal, it selects the "Store Credit" radio button, types "Package lost in transit" into the reason field, and submits the form. Page Agent can overlay a mask on the screen to show what elements it is interacting with, giving the user visibility into its actions.

5. Bot Confirms Completion. Once the action is complete, the agent provides a final confirmation message to the user.
Bot: "Done - $189.00 in store credit has been added to your account."
6. Verification. The user can now verify the action. Navigating to their account page shows a new balance of $189.00 in store credit. The order history for ACM-10318 also shows a note indicating that a credit was issued. The task is complete.
Production Trade-Offs and Limitations
While Page Agent is a capable tool, its client-side, text-first architecture involves specific trade-offs. Recognizing these limitations is important for planning a production deployment and setting correct expectations for its performance and scope.
Here are six points for developers to consider.
1. Constrained by the Document Object Model (DOM). The agent operates exclusively through the user interface, meaning it can only perform actions that a human user can. It interacts with the rendered DOM and cannot directly call backend APIs or access functions not exposed through UI elements. For example, before instructing the agent to process a refund, a corresponding modal or form must already exist and be accessible in the UI.
2. Sensitivity to UI Implementation. The agent's effectiveness is tied to the structure of the webpage's HTML. It performs best with clear, semantic code, and its ability to interact with elements can be diminished in certain scenarios.
- Complex or Non-Standard DOM: UIs with heavily nested shadow DOM can obscure elements from the agent's context.
- Virtualized Lists: Long lists that only render a subset of items to the DOM at any time will prevent the agent from interacting with off-screen elements.
- Canvas-Based Interfaces: Interfaces built entirely within a
<canvas>element are not suitable, as the agent relies on parsing HTML elements, not pixels.
3. Performance and Cost Considerations. Executing a task involves multiple round-trips to an LLM for reasoning and action planning, which introduces two main considerations for developers. The first is latency, as a multi-step workflow can take several seconds to complete. This makes the agent unsuitable for tasks requiring immediate, real-time interaction. The second is token consumption. Each step sends a sizeable context to the LLM, including the system prompt, the dehydrated DOM, and conversational history. A single complex task can use thousands of tokens, making it important to model API costs before a large-scale deployment.
4. Lack of Built-in State Management. Each execute() call is treated as a new, independent task. The core library does not include built-in memory for multi-turn conversations or sequential operations. As a developer, you are responsible for managing the application state and providing the necessary conversational history with each request. This adds a layer of complexity to your integration code.
5. Reliability of Prompt-Based Guardrails. The agent's safety mechanisms, such as instructions to "never auto-submit a payment form," are defined within its system prompt. These text-based instructions are persuasive guides, not absolute rules. A highly capable or misaligned LLM could potentially find ways to bypass them. For this reason, prompt-based safety measures should not replace server-side validation for sensitive or destructive actions.
6. Single-Page Focus in the Core Library. The main page-agent package is designed primarily for single-page applications and interactions. While it excels at complex workflows within a single view, it cannot navigate across different tabs or browser windows on its own. Multi-page automation requires the optional Page Agent Chrome Extension, which needs separate installation and user permissions.
Conclusion: Agents Need a Deployment Layer
Page Agent is valuable because it proves a simple point: a web agent is much more useful when it can work with the real interface, the real DOM, and the user's actual session. That is a better mental model than treating every web workflow as a remote browser recording, a brittle script, or a backend-only API integration.
But Page Agent also exposes the production problem. If the agent must be inserted into the application, then you either need source-code access, vendor cooperation, or a browser extension. Agent-browser products run into a related limitation: they can automate inside their own controlled environment, but they are often separated from the user's live session, support context, audit trail, masking rules, and human handoff.
Webfuse's angle is that agents need a deployment layer, not just a smarter browser. A Webfuse session can wrap an existing web application, inject extension logic, add collaboration tools, connect automation, and expose the live session through a shareable link. The original app does not need to be rewritten, and the end user does not need to install a browser extension.
So the practical split is clear: use Page Agent when you own the app and want an embedded client-side copilot. Use Webfuse when the harder problem is bringing agents into applications you do not own, inside real user sessions, with centralized control over the experience. For production AI agents, that deployment surface is often the difference between an impressive demo and something a team can actually roll out.
Frequently Asked Questions
Related Articles
DOM Downsampling for LLM-Based Web Agents
We propose D2Snap – a first-of-its-kind downsampling algorithm for DOMs. D2Snap can be used as a pre-processing technique for DOM snapshots to optimise web agency context quality and token costs.
A Gentle Introduction to AI Agents for the Web
LLMs only recently enabled serviceable web agents: autonomous systems that browse web on behalf of a human. Get started with fundamental methodology, key design challenges, and technological opportunities.
