Skip to content

Commit a0a4774

Browse files
committed
Add example configs
1 parent 283800c commit a0a4774

10 files changed

+266
-1
lines changed

README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,46 @@
1818

1919
<hr />
2020

21+
## 👋 Overview
2122

23+
CodeClash is a benchmark for evaluating AI systems on goal-oriented software engineering.
2224

25+
Today's AI coding evaluations are largely *task*-oriented (e.g., HumanEval, SWE-bench).
26+
Models are given explicit instructions and evaluated on their ability to write correct implementations.
27+
28+
But software is fundamentally driven by goals ("improve user retention", "reduce costs", "increase revenue").
29+
To enable *goal*-oriented SWE evaluation of Language Models (LMs) as SWE-agents, we introduce CodeClash!
30+
31+
<p align="center">
32+
<img src="docs/assets/flowchart.jpg" style="width: 70%" />
33+
</p>
34+
35+
In CodeClash, 2+ LM agents compete in a code arena.
36+
Across a multi-round tournament, agents iteratively improve a codebase to win a high level objective (e.g., accumulate resources, survive the longest, etc).
37+
Each round consists of two phases:
38+
39+
* Edit phase: LM agents make whatever changes they want to their codebase.
40+
* Competition phase: The modified codebases are pitted against each other in the arena.
41+
42+
Critically, *LMs don't play the game directly*.
43+
Their code serves as their competitive proxy.
44+
The winner is the LM agent who wins the most rounds.
45+
46+
## 🏎️ Quick Start
47+
48+
To start, follow these steps to set up CodeClash and run a test battle:
49+
```bash
50+
$ git clone [email protected]:CodeClash-ai/CodeClash.git
51+
$ cd CodeClash
52+
$ pip install -e '.[dev]'
53+
$ python main.py configs/test/battlesnake.yaml
54+
```
55+
56+
Once this works, you should be set up to run a real tournament!
57+
To pit Claude Sonnet 4.5 against o3 in the BattleSnake arena, run:
58+
```bash
59+
$ python main.py configs/examples/
60+
```
2361

2462
## 💫 Contributions
2563
We're actively working on several follow ups!
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: BattleSnake
5+
sims_per_round: 1000
6+
args:
7+
width: 11
8+
height: 11
9+
browser: false
10+
players:
11+
- agent: mini
12+
name: claude-sonnet-4-5-20250929
13+
config:
14+
agent: !include mini/default.yaml
15+
model:
16+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
17+
model_kwargs:
18+
temperature: 0.2
19+
max_tokens: 4096
20+
- agent: mini
21+
name: o3
22+
config:
23+
agent: !include mini/default.yaml
24+
model:
25+
model_name: '@openai/o3'
26+
prompts:
27+
game_description: |-
28+
You are a software developer ({{player_id}}) competing in a coding game called BattleSnake.
29+
Your bot (`main.py`) controls a snake on a grid-based board.
30+
Snakes collect food, avoid collisions, and try to outlast their opponents.
31+
32+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
33+
After you and your competitor finish editing your codebases, the game is run automatically.
34+
35+
Your task: improve the bot in `main.py`, located in {{working_dir}}.
36+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
37+
All of your commands will be executed in the {{working_dir}} directory (see notes below).
File renamed without changes.
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: CoreWar
5+
sims_per_round: 1000
6+
args: {}
7+
players:
8+
- agent: mini
9+
name: claude-sonnet-4-5-20250929
10+
config:
11+
agent: !include mini/default.yaml
12+
model:
13+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
14+
model_kwargs:
15+
temperature: 0.2
16+
max_tokens: 4096
17+
- agent: mini
18+
name: o3
19+
config:
20+
agent: !include mini/default.yaml
21+
model:
22+
model_name: '@openai/o3'
23+
prompts:
24+
game_description: |-
25+
You are a software developer ({{player_id}}) competing in a coding game called CoreWar.
26+
CoreWar is a programming battle where you write "warriors" in an assembly-like language called Redcode to compete within a virtual machine (MARS), aiming to eliminate your rivals by making their code self-terminate.
27+
Victory comes from crafting clever tactics—replicators, scanners, bombers—that exploit memory layout and instruction timing to control the core.
28+
29+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
30+
After you and your competitor finish editing your codebases, the game is run automatically.
31+
32+
Your task: improve the bot in `warrior.red`, located in {{working_dir}}.
33+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
34+
All of your commands will be executed in the {{working_dir}} directory (see notes below).
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: Halite
5+
sims_per_round: 250
6+
args: {}
7+
players:
8+
- agent: mini
9+
name: claude-sonnet-4-5-20250929
10+
config:
11+
agent: !include mini/default.yaml
12+
model:
13+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
14+
model_kwargs:
15+
temperature: 0.2
16+
max_tokens: 4096
17+
- agent: mini
18+
name: o3
19+
config:
20+
agent: !include mini/default.yaml
21+
model:
22+
model_name: '@openai/o3'
23+
prompts:
24+
game_description: |-
25+
You are a software developer ({{player_id}}) competing in a coding game called Halite.
26+
Halite is a multi-player turn-based strategy game where bots compete on a rectangular grid to capture territory and accumulate strength.
27+
Players control pieces that can move across the map to conquer neutral and enemy territory, with each cell providing production that increases the strength of pieces occupying it.
28+
The goal is to control the most territory by the end of the game through strategic expansion, consolidation of forces, and tactical combat decisions.
29+
30+
You have the choice of writing your Halite bot in one of four programming languages: C, C++, OCaml, or Rust.
31+
Example implementations can be found under the `airesources/` folder.
32+
Your submission should be stored in the `submission/` folder. This folder currently contains an example C bot, but feel free to use any of the supported languages.
33+
Please make sure your main file is named `main.<ext>`, where `<ext>` is the appropriate file extension for your chosen programming language.
34+
You may include additional files as needed, but please ensure:
35+
1. The `submission/` folder contains only files relevant to your bot.
36+
2. The `submission/` folder ONLY contains a single bot (no multiple bots in one submission).
37+
3. Your bot can be compiled. See `runGame.sh` under the corresponding `submission/<language>/` folder to see how we will compile and run your bot.
38+
39+
40+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
41+
After you and your competitor finish editing your codebases, the game is run automatically.
42+
43+
Your task: improve the bot in `submission`, located in {{working_dir}}.
44+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
45+
All of your commands will be executed in the {{working_dir}} directory (see notes below).
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: HuskyBench
5+
sims_per_round: 100
6+
args: {}
7+
players:
8+
- agent: mini
9+
name: claude-sonnet-4-5-20250929
10+
config:
11+
agent: !include mini/default.yaml
12+
model:
13+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
14+
model_kwargs:
15+
temperature: 0.2
16+
max_tokens: 4096
17+
- agent: mini
18+
name: o3
19+
config:
20+
agent: !include mini/default.yaml
21+
model:
22+
model_name: '@openai/o3'
23+
prompts:
24+
game_description: |-
25+
You are a software developer ({{player_id}}) competing in a coding game called HuskyBench.
26+
In this game, you will write code to control a poker-playing bot, aiming to outsmart your opponents and win chips.
27+
Victory comes from crafting clever strategies—bluffing, reading opponents, and managing your chip stack effectively.
28+
Be mindful of your bot's efficiency - your code should complete a simulation within 10 seconds to avoid forfeiting the round.
29+
You can use run_game.sh to check if your bot runs in time.
30+
31+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
32+
After you and your competitor finish editing your codebases, the game is run automatically.
33+
34+
Your task: improve the bot in `client/player.py`, located in {{working_dir}}.
35+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
36+
All of your commands will be executed in the {{working_dir}} directory (see notes below).
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: RoboCode
5+
sims_per_round: 250
6+
args:
7+
nodisplay: true
8+
nosound: true
9+
record_ratio: 0.2
10+
players:
11+
- agent: mini
12+
name: claudesonnet4520250929
13+
config:
14+
agent: !include mini/default.yaml
15+
model:
16+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
17+
model_kwargs:
18+
temperature: 0.2
19+
max_tokens: 4096
20+
- agent: mini
21+
name: o3
22+
config:
23+
agent: !include mini/default.yaml
24+
model:
25+
model_name: '@openai/o3'
26+
prompts:
27+
game_description: |-
28+
You are a software developer ({{player_id}}) competing in a coding game called RoboCode.
29+
Robocode (Tank Royale) is a programming game where your code is the tank: each turn your bot sends intents—speed plus body/gun/radar turn rates and firepower—based on the game state it perceives via radar.
30+
Your program decides how to move, aim, and fire in a deterministic, turn-based arena to outlast other bots.
31+
Your bot logic must be written in Java and located in the `robots/custom/` directory.
32+
Keep the main bot class named `MyTank.java`, but you can include additional Java files if you'd like.
33+
34+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
35+
After you and your competitor finish editing your codebases, the game is run automatically.
36+
37+
Your task: improve the bot in `robots/custom/`, located in {{working_dir}}.
38+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
39+
All of your commands will be executed in the {{working_dir}} directory (see notes below).
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
tournament:
2+
rounds: 5
3+
game:
4+
name: RobotRumble
5+
sims_per_round: 250
6+
args:
7+
raw: true
8+
players:
9+
- agent: mini
10+
name: claude-sonnet-4-5-20250929
11+
config:
12+
agent: !include mini/default.yaml
13+
model:
14+
model_name: '@anthropic/claude-sonnet-4-5-20250929'
15+
model_kwargs:
16+
temperature: 0.2
17+
max_tokens: 4096
18+
- agent: mini
19+
name: o3
20+
config:
21+
agent: !include mini/default.yaml
22+
model:
23+
model_name: '@openai/o3'
24+
prompts:
25+
game_description: |-
26+
You are a software developer ({{player_id}}) competing in a coding game called RobotRumble.
27+
RobotRumble is a turn-based coding battle where you program a team of robots in Python to move, attack, and outmaneuver your opponent on a grid.
28+
Every decision is driven by your code, and victory comes from crafting logic that positions robots smartly, times attacks well, and adapts over the 100-turn match.
29+
NOTE: Please ensure that your code runs efficiently (under 60 seconds). Code that exceeds this run time will automatically forfeit the round.
30+
31+
The game is played in 15 rounds. For every round, you (and your competitors) edit program code that controls your bot. This is round {{round}}.
32+
After you and your competitor finish editing your codebases, the game is run automatically.
33+
34+
Your task: improve the bot in `robot.js`, located in {{working_dir}}.
35+
{{working_dir}} is your codebase, which contains both your both and supporting assets.
36+
All of your commands will be executed in the {{working_dir}} directory (see notes below).

docs/assets/flowchart.jpg

691 KB
Loading

docs/reference/tournament/single_player.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ players:
3030
## Running a SinglePlayer Tournament
3131
3232
```bash
33-
python main_single_player.py configs/single_player/config.yaml
33+
python main_single_player.py configs/examples/battlesnake_single_player.yaml
3434
```
3535

3636
## Implementation

0 commit comments

Comments
 (0)