Skip to content

Commit b453d63

Browse files
Add web browser example to doc (#439)
1 parent cedf63c commit b453d63

File tree

2 files changed

+220
-0
lines changed

2 files changed

+220
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@
2828
title: Master you knowledge base with agentic RAG
2929
- local: examples/multiagents
3030
title: Orchestrate a multi-agent system
31+
- local: examples/web_browser
32+
title: Build a web browser agent using vision models
3133
- title: Reference
3234
sections:
3335
- local: reference/agents
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
# Web Browser Automation with Agents 🤖🌐
2+
3+
[[open-in-colab]]
4+
5+
In this notebook, we'll create an **agent-powered web browser automation system**! This system can navigate websites, interact with elements, and extract information automatically.
6+
7+
The agent will be able to:
8+
✅ Navigate to web pages
9+
✅ Click on elements
10+
✅ Search within pages
11+
✅ Handle popups and modals
12+
✅ Take screenshots
13+
✅ Extract information
14+
15+
Let's set up this system step by step.
16+
17+
First, run these lines to install the required dependencies:
18+
19+
```bash
20+
pip install smolagents selenium helium pillow python-dotenv -q
21+
```
22+
23+
Let's import our required libraries and set up environment variables:
24+
25+
```python
26+
from io import BytesIO
27+
from time import sleep
28+
29+
import helium
30+
from dotenv import load_dotenv
31+
from PIL import Image
32+
from selenium import webdriver
33+
from selenium.webdriver.common.by import By
34+
from selenium.webdriver.common.keys import Keys
35+
36+
from smolagents import CodeAgent, tool
37+
from smolagents.agents import ActionStep
38+
39+
# Load environment variables
40+
load_dotenv()
41+
```
42+
43+
Now let's create our core browser interaction tools that will allow our agent to navigate and interact with web pages:
44+
45+
```python
46+
@tool
47+
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
48+
"""
49+
Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
50+
Args:
51+
text: The text to search for
52+
nth_result: Which occurrence to jump to (default: 1)
53+
"""
54+
elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
55+
if nth_result > len(elements):
56+
raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
57+
result = f"Found {len(elements)} matches for '{text}'."
58+
elem = elements[nth_result - 1]
59+
driver.execute_script("arguments[0].scrollIntoView(true);", elem)
60+
result += f"Focused on element {nth_result} of {len(elements)}"
61+
return result
62+
63+
@tool
64+
def go_back() -> None:
65+
"""Goes back to previous page."""
66+
driver.back()
67+
68+
@tool
69+
def close_popups() -> str:
70+
"""
71+
Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!
72+
This does not work on cookie consent banners.
73+
"""
74+
webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
75+
```
76+
77+
Let's set up our browser with Chrome and configure screenshot capabilities:
78+
79+
```python
80+
# Configure Chrome options
81+
chrome_options = webdriver.ChromeOptions()
82+
chrome_options.add_argument("--force-device-scale-factor=1")
83+
chrome_options.add_argument("--window-size=1000,1350")
84+
chrome_options.add_argument("--disable-pdf-viewer")
85+
chrome_options.add_argument("--window-position=0,0")
86+
87+
# Initialize the browser
88+
driver = helium.start_chrome(headless=False, options=chrome_options)
89+
90+
# Set up screenshot callback
91+
def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
92+
sleep(1.0) # Let JavaScript animations happen before taking the screenshot
93+
driver = helium.get_driver()
94+
current_step = memory_step.step_number
95+
if driver is not None:
96+
for previous_memory_step in agent.memory.steps: # Remove previous screenshots for lean processing
97+
if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:
98+
previous_memory_step.observations_images = None
99+
png_bytes = driver.get_screenshot_as_png()
100+
image = Image.open(BytesIO(png_bytes))
101+
print(f"Captured a browser screenshot: {image.size} pixels")
102+
memory_step.observations_images = [image.copy()] # Create a copy to ensure it persists
103+
104+
# Update observations with current URL
105+
url_info = f"Current url: {driver.current_url}"
106+
memory_step.observations = (
107+
url_info if memory_step.observations is None else memory_step.observations + "\n" + url_info
108+
)
109+
```
110+
111+
Now let's create our web automation agent:
112+
113+
```python
114+
# Initialize the model
115+
model_id = "meta-llama/Llama-3.3-70B-Instruct" # You can change this to your preferred model
116+
model = HfApiModel(model_id)
117+
118+
# Create the agent
119+
agent = CodeAgent(
120+
tools=[go_back, close_popups, search_item_ctrl_f],
121+
model=model,
122+
additional_authorized_imports=["helium"],
123+
step_callbacks=[save_screenshot],
124+
max_steps=20,
125+
verbosity_level=2,
126+
)
127+
128+
# Import helium for the agent
129+
agent.python_executor("from helium import *", agent.state)
130+
```
131+
132+
The agent needs instructions on how to use Helium for web automation. Here are the instructions we'll provide:
133+
134+
```python
135+
helium_instructions = """
136+
You can use helium to access websites. Don't bother about the helium driver, it's already managed.
137+
We've already ran "from helium import *"
138+
Then you can go to pages!
139+
Code:
140+
```py
141+
go_to('github.com/trending')
142+
```<end_code>
143+
144+
You can directly click clickable elements by inputting the text that appears on them.
145+
Code:
146+
```py
147+
click("Top products")
148+
```<end_code>
149+
150+
If it's a link:
151+
Code:
152+
```py
153+
click(Link("Top products"))
154+
```<end_code>
155+
156+
If you try to interact with an element and it's not found, you'll get a LookupError.
157+
In general stop your action after each button click to see what happens on your screenshot.
158+
Never try to login in a page.
159+
160+
To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
161+
Code:
162+
```py
163+
scroll_down(num_pixels=1200) # This will scroll one viewport down
164+
```<end_code>
165+
166+
When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
167+
Just use your built-in tool `close_popups` to close them:
168+
Code:
169+
```py
170+
close_popups()
171+
```<end_code>
172+
173+
You can use .exists() to check for the existence of an element. For example:
174+
Code:
175+
```py
176+
if Text('Accept cookies?').exists():
177+
click('I accept')
178+
```<end_code>
179+
"""
180+
```
181+
182+
Now we can run our agent with a task! Let's try finding information on Wikipedia:
183+
184+
```python
185+
search_request = """
186+
Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word "1992" that mentions a construction accident.
187+
"""
188+
189+
agent_output = agent.run(search_request + helium_instructions)
190+
print("Final output:")
191+
print(agent_output)
192+
```
193+
194+
You can run different tasks by modifying the request. For example, here's for me to know if I should work harder:
195+
196+
```python
197+
github_request = """
198+
I'm trying to find how hard I have to work to get a repo in github.com/trending.
199+
Can you navigate to the profile for the top author of the top trending repo, and give me their total number of commits over the last year?
200+
"""
201+
202+
agent_output = agent.run(github_request + helium_instructions)
203+
print("Final output:")
204+
print(agent_output)
205+
```
206+
207+
The system is particularly effective for tasks like:
208+
- Data extraction from websites
209+
- Web research automation
210+
- UI testing and verification
211+
- Content monitoring
212+
213+
Best Practices:
214+
1. Always provide clear, specific instructions
215+
2. Use the screenshot callback for debugging
216+
3. Handle errors gracefully
217+
4. Clean up old screenshots to manage memory
218+
5. Set reasonable step limits for your tasks

0 commit comments

Comments
 (0)