AI Agent Evaluation Benchmarks for UI Grounding, Web Navigation, and Task Automation
AgentArena is a collection of reproducible benchmarks designed for evaluating AI agents on real-world UI tasks. Each benchmark provides static, deterministic environments that enable consistent evaluation across different models and approaches.
The first benchmark in AgentArena focuses on UI element grounding: the ability of an AI agent to locate and identify interactive elements in web interfaces.
UIArena v1 uses an orthogonal design across three dimensions to enable precise diagnosis of agent capabilities:
| Dimension | Values | Purpose |
|---|---|---|
| Visual | styled / mincss |
Tests reliance on CSS styling |
| DOM | clean / noisy |
Tests robustness to DOM complexity |
| Accessibility | full / none |
Tests reliance on ARIA attributes |
| Code | Name | Visual | DOM | A11y | Difficulty |
|---|---|---|---|---|---|
c0 |
Baseline | styled | clean | full | Easy |
c1 |
No-Visual | mincss | clean | full | Medium |
c2 |
No-A11y | styled | clean | none | Medium |
c3 |
Noisy-DOM | styled | noisy | full | Medium |
c4 |
Hard | mincss | noisy | none | Hard |
UIArena v1 includes 15 environments across 3 page types × 5 variants:
| Page | Variants | Description |
|---|---|---|
| Search | c0, c1, c2, c3, c4 | Search engine interface (Google-like) |
| E-commerce | c0, c1, c2, c3, c4 | Product listing with filters and cart |
| Dashboard | c0, c1, c2, c3, c4 | Email inbox with sidebar navigation |
/v1/arena/{page}/{variant}
Examples:
/v1/arena/search/c0 # Search page, baseline variant
/v1/arena/ecommerce/c1 # E-commerce, no-visual variant
/v1/arena/dashboard/c4 # Dashboard, hard variant
- Vision-Language Model Evaluation: Test VLMs on UI element detection and localization
- Web Agent Benchmarking: Evaluate agents that navigate and interact with web pages
- Grounding Research: Study how models ground natural language to UI elements
- Reproducible Comparisons: Static snapshots ensure consistent evaluation across runs
- Agent Development: Iterate on agent capabilities using orthogonal test dimensions
- Node.js 18+
- npm or yarn
git clone https://github.com/your-org/AgentArena.git
cd AgentArena
npm installnpm run devVisit http://localhost:3000/v1 to explore the benchmark.
npm run buildThe static site will be exported to the /out directory, ready for deployment on any static hosting service.
AgentArena/
├── public/v1/arena/ # Static HTML environments
│ ├── search/
│ │ ├── c0/ # index.html, style.css, metadata.json
│ │ ├── c1/
│ │ ├── c2/
│ │ ├── c3/
│ │ └── c4/
│ ├── ecommerce/
│ │ └── c0..c4/
│ └── dashboard/
│ └── c0..c4/
├── src/app/
│ ├── v1/ # Next.js wrapper pages
│ │ ├── page.tsx # Benchmark homepage
│ │ └── arena/ # Environment wrapper pages
│ └── layout.tsx # Root layout with metadata
└── next.config.ts # Static export configuration
Each environment includes a metadata.json file:
{
"name": "Search Page",
"variant": "c0",
"variantName": "baseline",
"difficulty": "easy",
"dimensions": {
"visual": "styled",
"dom": "clean",
"accessibility": "full"
},
"description": "Baseline search interface with full styling, clean semantic HTML, and complete ARIA accessibility.",
"tasks": [
"Locate the main search input field",
"Locate the primary search button",
"Locate the settings icon button"
]
}Contributions are welcome! Whether you want to:
- Add new environments or page types
- Improve existing variants
- Add evaluation scripts or metrics
- Report issues or suggest enhancements
Please open an issue or submit a pull request.
ai-agents · benchmark · ui-grounding · web-automation · llm-evaluation · computer-use · vision-language-models · web-agents · accessibility-testing · multimodal-ai