Skip to content

Commit 588191d

Browse files
committed
update docs
1 parent c8f87d6 commit 588191d

File tree

3 files changed

+254
-8
lines changed

3 files changed

+254
-8
lines changed

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ This is a browser automation agent built on the Kernel platform that uses browse
2828
- `browser-use`: Web automation library providing Agent and BrowserSession
2929
- `kernel`: Platform for running the browser agent service
3030
- `zenbase-llml`: LLM templating used in task construction
31-
- Environment: Python 3.13, uses `uv` for dependency management, `just` for task running
31+
- Environment: Python 3.11, uses `uv` for dependency management, `just` for task running
3232

3333
### Environment Variables
34-
Requires `AI_GATEWAY_URL` and `AI_GATEWAY_TOKEN` for LLM provider routing through AI gateway.
34+
Requires `AI_GATEWAY_URL` and `AI_GATEWAY_TOKEN` for LLM provider routing through AI gateway.

README.md

Lines changed: 243 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,244 @@
1-
# Notes
1+
# Browser Agent
22

3-
Entrypoint (main.py) must be in the root directory
3+
An AI-powered browser automation microservice built on the Kernel platform that uses browser-use for intelligent web browsing tasks.
4+
5+
## Overview
6+
7+
The browser-agent microservice provides AI-powered browser automation capabilities, allowing you to control browsers using natural language instructions. It supports multiple LLM providers (Anthropic Claude, OpenAI GPT, Google Gemini) and can handle complex multi-step web tasks including data extraction, form filling, file downloads, and CAPTCHA solving.
8+
9+
## Features
10+
11+
- **AI-powered browser automation**: Uses LLMs to intelligently control browsers and perform complex web tasks
12+
- **Multi-step task execution**: Decomposes complex requests into sub-tasks and executes them sequentially
13+
- **Multi-provider LLM support**: Works with Anthropic Claude, OpenAI GPT, and Google Gemini
14+
- **File handling**: Automatically downloads PDFs and other files, uploads them to cloud storage
15+
- **CAPTCHA solving**: Built-in capability to handle CAPTCHAs and similar challenges
16+
- **Session management**: Creates isolated browser sessions with proper cleanup
17+
- **Trajectory tracking**: Records and stores complete execution history for analysis
18+
- **Cloudflare AI Gateway integration**: Unified LLM provider routing
19+
20+
## Quick Start
21+
22+
### Prerequisites
23+
24+
- Python 3.11+
25+
- `uv` package manager
26+
- `just` task runner
27+
- Node.js with `bun` (for deployment tools)
28+
29+
### Installation
30+
31+
```bash
32+
# Install dependencies
33+
uv install
34+
35+
# Install development dependencies
36+
uv install --group dev
37+
```
38+
39+
### Environment Setup
40+
41+
Create a `.env` file with the required environment variables:
42+
43+
```bash
44+
# AI Gateway Configuration (required)
45+
AI_GATEWAY_URL="https://gateway.ai.cloudflare.com/v1/{account_id}/ai-gateway"
46+
AI_GATEWAY_TOKEN="your-gateway-token"
47+
48+
# Kernel Platform (required)
49+
KERNEL_API_KEY="sk_xxxxx"
50+
51+
# Cloudflare R2 Storage for file uploads (required)
52+
R2_S3_BUCKET="browser-agent"
53+
R2_S3_ACCESS_KEY_ID="your-access-key"
54+
R2_S3_ENDPOINT_URL="https://{account_id}.r2.cloudflarestorage.com"
55+
R2_S3_SECRET_ACCESS_KEY="your-secret-key"
56+
57+
# Optional
58+
BROWSER_USE_LOGGING_LEVEL="debug"
59+
ANONYMIZED_TELEMETRY="false"
60+
```
61+
62+
### Local Development
63+
64+
```bash
65+
# Run local development server
66+
just dev
67+
68+
# Format and lint code
69+
just fmt
70+
```
71+
72+
### Production
73+
74+
```bash
75+
just deploy
76+
77+
just logs
78+
```
79+
80+
## API Reference
81+
82+
### Endpoint
83+
84+
`POST /apps/browser-agent/actions/perform`
85+
86+
### Request Format
87+
88+
```json
89+
{
90+
"input": "Task description for the browser agent",
91+
"provider": "anthropic|gemini|openai",
92+
"model": "claude-4-sonnet|gpt-4.1|gemini-2.5-pro",
93+
"api_key": "your-llm-api-key",
94+
"instructions": "Optional additional instructions",
95+
"stealth": true,
96+
"headless": false,
97+
"browser_timeout": 60,
98+
"max_steps": 100,
99+
"reasoning": true,
100+
"flash": false
101+
}
102+
```
103+
104+
### Request Parameters
105+
106+
- `input` (required): Natural language description of the task to perform
107+
- `provider` (required): LLM provider (`"anthropic"`, `"gemini"`, or `"openai"`)
108+
- `model` (required): Specific model to use (e.g., `"claude-3-sonnet-20240229"`)
109+
- `api_key` (required): API key for the LLM provider
110+
- `instructions` (optional): Additional context or constraints for the task
111+
- `stealth` (optional): Enable stealth mode to avoid detection (default: `true`)
112+
- `headless` (optional): Run browser in headless mode (default: `false`)
113+
- `browser_timeout` (optional): Browser session shutdown timeout in seconds (default: 60)
114+
- `max_steps` (optional): Maximum number of automation steps (default: 100)
115+
- `reasoning` (optional): Enable step-by-step reasoning (default: `true`)
116+
- `flash` (optional): Use faster execution mode (default: `false`)
117+
118+
### Response Format
119+
120+
```json
121+
{
122+
"session": "browser-session-id",
123+
"success": true,
124+
"duration": 45.2,
125+
"result": "Task completion summary",
126+
"downloads": {
127+
"filename.pdf": "https://presigned-url",
128+
"data.csv": "https://presigned-url"
129+
}
130+
}
131+
```
132+
133+
### Response Fields
134+
135+
- `session`: Unique browser session identifier
136+
- `success`: Whether the task completed successfully
137+
- `duration`: Execution time in seconds
138+
- `result`: Summary of what was accomplished
139+
- `downloads`: Dictionary of downloaded files with presigned URLs
140+
141+
## Examples
142+
143+
### Basic Web Scraping
144+
145+
```json
146+
{
147+
"input": "Go to example.com and extract all the text content from the main article",
148+
"provider": "anthropic",
149+
"model": "claude-4-sonnet",
150+
"api_key": "sk-ant-xxxxx",
151+
"headless": true,
152+
"max_steps": 50
153+
}
154+
```
155+
156+
### Complex Task with File Download
157+
158+
```json
159+
{
160+
"input": "Search for Python tutorials on Google and download the first PDF result",
161+
"instructions": "Make sure to verify the PDF is relevant before downloading",
162+
"provider": "openai",
163+
"model": "gpt-4.1",
164+
"api_key": "sk-xxxxx",
165+
"headless": false,
166+
"reasoning": true
167+
}
168+
```
169+
170+
### Form Filling
171+
172+
```json
173+
{
174+
"input": "Fill out the contact form on example.com with name 'John Doe', email '[email protected]', and message 'Hello world'",
175+
"provider": "gemini",
176+
"model": "gemini-2.5-pro",
177+
"api_key": "your-gemini-key",
178+
"stealth": true
179+
}
180+
```
181+
182+
## Deployment
183+
184+
### Development Commands
185+
186+
```bash
187+
just fmt # Format and lint code with ruff
188+
just dev # Run local development server
189+
just logs # View browser-agent logs
190+
```
191+
192+
### Production Deployment
193+
194+
```bash
195+
just deploy # Deploy to Kernel platform
196+
```
197+
198+
The deployment process:
199+
1. Runs formatting and linting checks
200+
2. Deploys `src/app.py` to the Kernel platform
201+
3. Service becomes available at the configured Kernel endpoint
202+
203+
## Architecture
204+
205+
### Core Components
206+
207+
- **`src/app.py`**: Main Kernel app with `browser-agent` action. Creates browsers via kernel, instantiates Agent with custom session, runs tasks and returns trajectory results.
208+
- **`src/lib/browser/session.py`**: CustomBrowserSession that extends browser-use's BrowserSession, fixing viewport handling for CDP connections and setting fixed 1024x786 resolution.
209+
- **`src/lib/browser/models.py`**: BrowserAgentRequest model handling LLM provider abstraction (anthropic, gemini, openai) with AI gateway integration.
210+
- **`src/lib/gateway.py`**: AI gateway configuration from environment variables.
211+
212+
### Key Dependencies
213+
214+
- `browser-use>=0.7.2` - Web automation library providing Agent and BrowserSession
215+
- `kernel>=0.11.0` - Platform for running the browser agent service
216+
- `zenbase-llml>=0.4.0` - LLM templating used in task construction
217+
- `pydantic>=2.10.6` - Data validation and serialization
218+
- `boto3>=1.40.25` - AWS S3/R2 integration for file storage
219+
220+
### Architecture Flow
221+
222+
1. Request received via Kernel platform
223+
2. LLM client created based on provider/model through AI Gateway
224+
3. Remote browser session established with custom configuration
225+
4. browser-use Agent instantiated with reasoning capabilities
226+
5. Task executed with intelligent planning and step-by-step execution
227+
6. Files automatically uploaded to Cloudflare R2 storage
228+
7. Trajectory and results returned with download links
229+
230+
## Troubleshooting
231+
232+
### Common Issues
233+
234+
- **Environment variables**: Ensure all required environment variables are set
235+
- **Browser timeout**: Increase `browser_timeout` for complex tasks
236+
- **File downloads**: Check R2 bucket permissions and configuration
237+
- **LLM provider errors**: Verify API keys and model availability
238+
- **Deployment issues**: Ensure that the main entrypoint is in the root of the directory
239+
240+
## Contributing
241+
242+
1. Format code: `just fmt`
243+
2. Test changes locally: `just dev`
244+
3. Deploy to staging: `just deploy`

lib/patch.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,20 @@ class PatchedDownloadsWatchdog(DownloadsWatchdog):
2424
LISTENS_TO = DownloadsWatchdog.LISTENS_TO + [BrowserStateRequestEvent]
2525

2626
async def on_BrowserStateRequestEvent(self, event: BrowserStateRequestEvent):
27+
cdp_session = self.browser_session.agent_focus
28+
if not cdp_session:
29+
return
30+
2731
page_url = await self.browser_session.get_current_page_url()
28-
target_id = self.browser_session.agent_focus.target_id
32+
if not page_url:
33+
return
2934

3035
# Mock NavigationCompleteEvent
31-
await self.event_bus.dispatch(
36+
self.event_bus.dispatch(
3237
NavigationCompleteEvent(
38+
event_parent_id=event.event_id,
3339
event_type="NavigationCompleteEvent",
34-
target_id=target_id,
40+
target_id=cdp_session.target_id,
3541
url=page_url,
36-
event_parent_id=event.event_id,
3742
)
3843
)

0 commit comments

Comments
 (0)