Skip to content

Commit d3c00aa

Browse files
committed
update docs
1 parent c8f87d6 commit d3c00aa

File tree

3 files changed

+252
-8
lines changed

3 files changed

+252
-8
lines changed

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ This is a browser automation agent built on the Kernel platform that uses browse
2828
- `browser-use`: Web automation library providing Agent and BrowserSession
2929
- `kernel`: Platform for running the browser agent service
3030
- `zenbase-llml`: LLM templating used in task construction
31-
- Environment: Python 3.13, uses `uv` for dependency management, `just` for task running
31+
- Environment: Python 3.11, uses `uv` for dependency management, `just` for task running
3232

3333
### Environment Variables
34-
Requires `AI_GATEWAY_URL` and `AI_GATEWAY_TOKEN` for LLM provider routing through AI gateway.
34+
Requires `AI_GATEWAY_URL` and `AI_GATEWAY_TOKEN` for LLM provider routing through AI gateway.

README.md

Lines changed: 241 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,242 @@
1-
# Notes
1+
# Browser Agent
22

3-
Entrypoint (main.py) must be in the root directory
3+
An AI-powered browser automation microservice built on the Kernel platform that uses browser-use for intelligent web browsing tasks.
4+
5+
## Overview
6+
7+
The browser-agent microservice provides AI-powered browser automation capabilities, allowing you to control browsers using natural language instructions. It supports multiple LLM providers (Anthropic Claude, OpenAI GPT, Google Gemini) and can handle complex multi-step web tasks including data extraction, form filling, file downloads, and CAPTCHA solving.
8+
9+
## Features
10+
11+
- **AI-powered browser automation**: Uses LLMs to intelligently control browsers and perform complex web tasks
12+
- **Multi-step task execution**: Decomposes complex requests into sub-tasks and executes them sequentially
13+
- **Multi-provider LLM support**: Works with Anthropic Claude, OpenAI GPT, and Google Gemini
14+
- **File handling**: Automatically downloads PDFs and other files, uploads them to cloud storage
15+
- **CAPTCHA solving**: Built-in capability to handle CAPTCHAs and similar challenges
16+
- **Session management**: Creates isolated browser sessions with proper cleanup
17+
- **Trajectory tracking**: Records and stores complete execution history for analysis
18+
- **Cloudflare AI Gateway integration**: Unified LLM provider routing
19+
20+
## Quick Start
21+
22+
### Prerequisites
23+
24+
- Python 3.11+
25+
- `uv` package manager
26+
- `just` task runner
27+
- Node.js with `bun` (for deployment tools)
28+
29+
### Installation
30+
31+
```bash
32+
# Install dependencies
33+
uv install
34+
35+
# Install development dependencies
36+
uv install --group dev
37+
```
38+
39+
### Environment Setup
40+
41+
Create a `.env` file with the required environment variables:
42+
43+
```bash
44+
# AI Gateway Configuration (required)
45+
AI_GATEWAY_URL="https://gateway.ai.cloudflare.com/v1/{account_id}/ai-gateway"
46+
AI_GATEWAY_TOKEN="your-gateway-token"
47+
48+
# Kernel Platform (required)
49+
KERNEL_API_KEY="sk_xxxxx"
50+
51+
# Cloudflare R2 Storage for file uploads (required)
52+
R2_S3_BUCKET="browser-agent"
53+
R2_S3_ACCESS_KEY_ID="your-access-key"
54+
R2_S3_ENDPOINT_URL="https://{account_id}.r2.cloudflarestorage.com"
55+
R2_S3_SECRET_ACCESS_KEY="your-secret-key"
56+
57+
# Optional
58+
BROWSER_USE_LOGGING_LEVEL="debug"
59+
ANONYMIZED_TELEMETRY="false"
60+
```
61+
62+
### Local Development
63+
64+
```bash
65+
# Run local development server
66+
just dev
67+
68+
# Format and lint code
69+
just fmt
70+
71+
# View logs
72+
just logs
73+
```
74+
75+
## API Reference
76+
77+
### Endpoint
78+
79+
`POST /apps/browser-agent/actions/perform`
80+
81+
### Request Format
82+
83+
```json
84+
{
85+
"input": "Task description for the browser agent",
86+
"provider": "anthropic|gemini|openai",
87+
"model": "claude-4-sonnet|gpt-4.1|gemini-2.5-pro",
88+
"api_key": "your-llm-api-key",
89+
"instructions": "Optional additional instructions",
90+
"stealth": true,
91+
"headless": false,
92+
"browser_timeout": 60,
93+
"max_steps": 100,
94+
"reasoning": true,
95+
"flash": false
96+
}
97+
```
98+
99+
### Request Parameters
100+
101+
- `input` (required): Natural language description of the task to perform
102+
- `provider` (required): LLM provider (`"anthropic"`, `"gemini"`, or `"openai"`)
103+
- `model` (required): Specific model to use (e.g., `"claude-3-sonnet-20240229"`)
104+
- `api_key` (required): API key for the LLM provider
105+
- `instructions` (optional): Additional context or constraints for the task
106+
- `stealth` (optional): Enable stealth mode to avoid detection (default: `true`)
107+
- `headless` (optional): Run browser in headless mode (default: `false`)
108+
- `browser_timeout` (optional): Browser session shutdown timeout in seconds (default: 60)
109+
- `max_steps` (optional): Maximum number of automation steps (default: 100)
110+
- `reasoning` (optional): Enable step-by-step reasoning (default: `true`)
111+
- `flash` (optional): Use faster execution mode (default: `false`)
112+
113+
### Response Format
114+
115+
```json
116+
{
117+
"session": "browser-session-id",
118+
"success": true,
119+
"duration": 45.2,
120+
"result": "Task completion summary",
121+
"downloads": {
122+
"filename.pdf": "https://presigned-url",
123+
"data.csv": "https://presigned-url"
124+
}
125+
}
126+
```
127+
128+
### Response Fields
129+
130+
- `session`: Unique browser session identifier
131+
- `success`: Whether the task completed successfully
132+
- `duration`: Execution time in seconds
133+
- `result`: Summary of what was accomplished
134+
- `downloads`: Dictionary of downloaded files with presigned URLs
135+
136+
## Examples
137+
138+
### Basic Web Scraping
139+
140+
```json
141+
{
142+
"input": "Go to example.com and extract all the text content from the main article",
143+
"provider": "anthropic",
144+
"model": "claude-4-sonnet",
145+
"api_key": "sk-ant-xxxxx",
146+
"headless": true,
147+
"max_steps": 50
148+
}
149+
```
150+
151+
### Complex Task with File Download
152+
153+
```json
154+
{
155+
"input": "Search for Python tutorials on Google and download the first PDF result",
156+
"instructions": "Make sure to verify the PDF is relevant before downloading",
157+
"provider": "openai",
158+
"model": "gpt-4.1",
159+
"api_key": "sk-xxxxx",
160+
"headless": false,
161+
"reasoning": true
162+
}
163+
```
164+
165+
### Form Filling
166+
167+
```json
168+
{
169+
"input": "Fill out the contact form on example.com with name 'John Doe', email '[email protected]', and message 'Hello world'",
170+
"provider": "gemini",
171+
"model": "gemini-2.5-pro",
172+
"api_key": "your-gemini-key",
173+
"stealth": true
174+
}
175+
```
176+
177+
## Deployment
178+
179+
### Development Commands
180+
181+
```bash
182+
just fmt # Format and lint code with ruff
183+
just dev # Run local development server
184+
just logs # View browser-agent logs
185+
```
186+
187+
### Production Deployment
188+
189+
```bash
190+
just deploy # Deploy to Kernel platform
191+
```
192+
193+
The deployment process:
194+
1. Runs formatting and linting checks
195+
2. Deploys `src/app.py` to the Kernel platform
196+
3. Service becomes available at the configured Kernel endpoint
197+
198+
## Architecture
199+
200+
### Core Components
201+
202+
- **`src/app.py`**: Main Kernel app with `browser-agent` action. Creates browsers via kernel, instantiates Agent with custom session, runs tasks and returns trajectory results.
203+
- **`src/lib/browser/session.py`**: CustomBrowserSession that extends browser-use's BrowserSession, fixing viewport handling for CDP connections and setting fixed 1024x786 resolution.
204+
- **`src/lib/browser/models.py`**: BrowserAgentRequest model handling LLM provider abstraction (anthropic, gemini, openai) with AI gateway integration.
205+
- **`src/lib/gateway.py`**: AI gateway configuration from environment variables.
206+
207+
### Key Dependencies
208+
209+
- `browser-use>=0.7.2` - Web automation library providing Agent and BrowserSession
210+
- `kernel>=0.11.0` - Platform for running the browser agent service
211+
- `zenbase-llml>=0.4.0` - LLM templating used in task construction
212+
- `pydantic>=2.10.6` - Data validation and serialization
213+
- `boto3>=1.40.25` - AWS S3/R2 integration for file storage
214+
215+
### Architecture Flow
216+
217+
1. Request received via Kernel platform
218+
2. LLM client created based on provider/model through AI Gateway
219+
3. Remote browser session established with custom configuration
220+
4. browser-use Agent instantiated with reasoning capabilities
221+
5. Task executed with intelligent planning and step-by-step execution
222+
6. Files automatically uploaded to Cloudflare R2 storage
223+
7. Trajectory and results returned with download links
224+
225+
## Troubleshooting
226+
227+
### Common Issues
228+
229+
- **Environment variables**: Ensure all required environment variables are set
230+
- **Browser timeout**: Increase `browser_timeout` for complex tasks
231+
- **File downloads**: Check R2 bucket permissions and configuration
232+
- **LLM provider errors**: Verify API keys and model availability
233+
234+
### Logs
235+
236+
Use `just logs` to view real-time service logs for debugging.
237+
238+
## Contributing
239+
240+
1. Format code: `just fmt`
241+
2. Test changes locally: `just dev`
242+
3. Deploy to staging: `just deploy`

lib/patch.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,20 @@ class PatchedDownloadsWatchdog(DownloadsWatchdog):
2424
LISTENS_TO = DownloadsWatchdog.LISTENS_TO + [BrowserStateRequestEvent]
2525

2626
async def on_BrowserStateRequestEvent(self, event: BrowserStateRequestEvent):
27+
cdp_session = self.browser_session.agent_focus
28+
if not cdp_session:
29+
return
30+
2731
page_url = await self.browser_session.get_current_page_url()
28-
target_id = self.browser_session.agent_focus.target_id
32+
if not page_url:
33+
return
2934

3035
# Mock NavigationCompleteEvent
31-
await self.event_bus.dispatch(
36+
self.event_bus.dispatch(
3237
NavigationCompleteEvent(
38+
event_parent_id=event.event_id,
3339
event_type="NavigationCompleteEvent",
34-
target_id=target_id,
40+
target_id=cdp_session.target_id,
3541
url=page_url,
36-
event_parent_id=event.event_id,
3742
)
3843
)

0 commit comments

Comments
 (0)