You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add VIEWPORT_SIZE config and expand LLM provider documentation
- Add VIEWPORT_SIZE environment variable configuration in .env.example
- Document support for Azure OpenAI, Groq, and Ollama providers
- Update README with comprehensive examples for all LLM providers
- Clarify default viewport size (1440x900) and configuration options
- Improve .env.example structure with clearer option groupings
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+65-9Lines changed: 65 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,13 @@ An AI-powered browser automation microservice built on the Kernel platform that
4
4
5
5
## Overview
6
6
7
-
The browser-agent microservice provides AI-powered browser automation capabilities, allowing you to control browsers using natural language instructions. It supports multiple LLM providers (Anthropic Claude, OpenAI GPT, Google Gemini) and can handle complex multi-step web tasks including data extraction, form filling, file downloads, and CAPTCHA solving.
7
+
The browser-agent microservice provides AI-powered browser automation capabilities, allowing you to control browsers using natural language instructions. It supports multiple LLM providers (Anthropic Claude, OpenAI GPT, Google Gemini, Azure OpenAI, Groq, and Ollama) and can handle complex multi-step web tasks including data extraction, form filling, file downloads, and CAPTCHA solving.
8
8
9
9
## Features
10
10
11
11
-**AI-powered browser automation**: Uses LLMs to intelligently control browsers and perform complex web tasks
12
12
-**Multi-step task execution**: Decomposes complex requests into sub-tasks and executes them sequentially
13
-
-**Multi-provider LLM support**: Works with Anthropic Claude, OpenAI GPT, and Google Gemini
13
+
-**Multi-provider LLM support**: Works with Anthropic Claude, OpenAI GPT, Google Gemini, Azure OpenAI, Groq, and Ollama
14
14
-**File handling**: Automatically downloads PDFs and other files, uploads them to cloud storage
15
15
-**CAPTCHA solving**: Built-in capability to handle CAPTCHAs and similar challenges
16
16
-**Session management**: Creates isolated browser sessions with proper cleanup
@@ -38,8 +38,8 @@ Edit your `.env` file with the required values:
38
38
39
39
```bash
40
40
# LLM Provider Configuration
41
-
# Option 1: Direct API access (no gateway)
42
-
# Nothing required here!
41
+
# Option 1: Direct API access (no gateway) - providers use default endpoints
42
+
# Nothing required here - providers will use their default API endpoints!
"input": "Fill out the contact form on example.com with name 'John Doe', email '[email protected]', and message 'Hello world'",
166
186
"provider": "gemini",
167
-
"model": "gemini-2.5-pro",
187
+
"model": "gemini-2.0-flash-exp",
168
188
"api_key": "your-gemini-key",
169
189
"stealth": true
170
190
}
171
191
```
172
192
193
+
### Using Azure OpenAI
194
+
195
+
```json
196
+
{
197
+
"input": "Navigate to news.ycombinator.com and summarize the top 5 stories",
198
+
"provider": "azure_openai",
199
+
"model": "gpt-4o",
200
+
"api_key": "your-azure-openai-key",
201
+
"headless": true
202
+
}
203
+
```
204
+
205
+
### Using Groq
206
+
207
+
```json
208
+
{
209
+
"input": "Search for 'climate change' on Wikipedia and extract the first paragraph",
210
+
"provider": "groq",
211
+
"model": "llama-3.3-70b-versatile",
212
+
"api_key": "your-groq-key",
213
+
"reasoning": true
214
+
}
215
+
```
216
+
217
+
### Using Ollama (Local)
218
+
219
+
```json
220
+
{
221
+
"input": "Go to example.com and take a screenshot of the homepage",
222
+
"provider": "ollama",
223
+
"model": "llama3.2",
224
+
"api_key": "not-required-for-ollama",
225
+
"headless": false
226
+
}
227
+
```
228
+
173
229
## Available Commands
174
230
175
231
This project uses [just](https://just.systems) as a task runner. All commands are defined in the `justfile`.
@@ -215,7 +271,7 @@ The deployment process:
215
271
216
272
-**`src/app.py`**: Main Kernel app with `browser-agent` action. Creates browsers via kernel, instantiates Agent with custom session, runs tasks and returns trajectory results.
217
273
-**`src/lib/browser/session.py`**: CustomBrowserSession that extends browser-use's BrowserSession, fixing viewport handling for CDP connections and setting fixed 1024x786 resolution.
218
-
-**`src/lib/browser/models.py`**: BrowserAgentRequest model handling LLM provider abstraction (anthropic, gemini, openai) with AI gateway integration.
274
+
-**`src/lib/browser/models.py`**: BrowserAgentRequest model handling LLM provider abstraction (anthropic, gemini, openai, azure_openai, groq, ollama) with AI gateway integration.
219
275
-**`src/lib/gateway.py`**: AI gateway configuration from environment variables.
0 commit comments