Skip to content

Environment-owned reward#6238

Open
qgallouedec wants to merge 9 commits into
mainfrom
environment-owned-reward
Open

Environment-owned reward#6238
qgallouedec wants to merge 9 commits into
mainfrom
environment-owned-reward

Conversation

@qgallouedec

@qgallouedec qgallouedec commented Jul 1, 2026

Copy link
Copy Markdown
Member

Implements the RFC in #5912: lets the environment own the reward. If an environment_factory environment defines a reserved get_reward() method (no args → float), it is called once per completed rollout and added as a reward source. So reward_funcs becomes optional.

Important

This adds no new capability. It's an ergonomics change. The same reward was always expressible through reward_funcs; this just lets the environment own it directly, which is the natural formulation for stateful environments.

This matters for stateful / multi-turn environments, where the reward is a function of the environment's internal state (was the word guessed? did the game end in a win?). Today that state has to be leaked back out to a trainer-owned reward_func; now the environment scores itself.

Before

The reward lives in the trainer and reaches into the environment to recompute what it already knows:

class WordleEnv:
    # Reserved methods (not exposed as tools)
    def reset(self, **kwargs):
        self._target = sample(words)
        self._solved = False

    # Public methods (exposed as tools)
    def guess(self, word: str) -> str:   # exposed as a tool
        self._solved = word == self._target
        ...

def solved_reward(environments, **kwargs):
    return [1.0 if env._solved else 0.0 for env in environments]

trainer = GRPOTrainer(
    model=model,
    reward_funcs=solved_reward,          # trainer owns the reward
    train_dataset=dataset,
    environment_factory=WordleEnv,
)

After

The environment scores the episode it just ran:

class WordleEnv:
   # Reserved methods (not exposed as tools)
    def reset(self, **kwargs):  # required
        self._target = sample(words)
        self._solved = False

    def get_reward(self) -> float:  # optional
        return 1.0 if self._solved else 0.0

    # Public methods (exposed as tools)
    def guess(self, word: str) -> str:   # exposed as a tool
        self._solved = word == self._target
        ...

trainer = GRPOTrainer(
    model=model,
    train_dataset=dataset,
    environment_factory=WordleEnv,       # owns the reward; reward_funcs becomes optional
)

Details

  • get_reward is reserved (like reset): it is not exposed to the model as a tool.
  • Logged under the environment's class name (rewards/WordleEnv/mean), always weight 1: the environment owns its scale.
  • Combines with reward_funcs: all sources are summed; reward_weights applies to reward_funcs only.
  • Raises at init if no reward source is provided (no reward_funcs and no get_reward).
  • Implemented for GRPOTrainer, DPPOTrainer, GRPOWithReplayBufferTrainer, and AsyncGRPOTrainer. Docs + tests included.

AI writing disclosure

  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.

Note

Medium Risk
Changes core RL reward plumbing and makes reward_funcs optional, so misconfigured trainers now error at init but existing scripts without rewards could break; metric key rename (reward/rewards/) may affect dashboards.

Overview
Adds environment-owned rewards for agent training: when an environment_factory class defines optional get_reward() -> float, the trainer calls it once per completed rollout and treats it as an extra reward source (weight 1, logged as rewards/{EnvClassName}/mean|std). reward_funcs is now optional if the environment supplies get_reward; trainer-owned and env-owned rewards sum together, with reward_weights applying only to reward_funcs.

get_reward is reserved like reset—not exposed to the model as a tool—and environment method discovery excludes both reset and get_reward when building tool lists.

GRPOTrainer registers an internal reward wrapper at init and fails fast with ValueError when neither reward_funcs nor env get_reward is provided. AsyncGRPOTrainer / AsyncRolloutWorker capture per-rollout env_rewards at generation time and append them during scoring. Docstrings for DPPO (and related trainers) align with the same contract; GRPO/RLOO docs rename per-function metrics from reward/... to rewards/... and expand the agent-training guide (tools vs environments, combined rewards).

Tests cover env-only reward, coexistence with reward_funcs, and the no-reward-source error.

Reviewed by Cursor Bugbot for commit 218ff6f. Bugbot is set up for automated code reviews on this repo. Configure here.

@bot-ci-comment

bot-ci-comment Bot commented Jul 1, 2026

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tools = tools or []
self._standalone_tools = tools # tools that are not bound to the environment
self.environment_factory = environment_factory
self._env_owns_reward = False

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: owns reward points to an env that handles all the reward computation, but we can still pass in separate reward functions?

# The environment owns the reward: score it now, while this rollout's environment still holds its
# final state (it is reset only when drawn again for the next rollout).
if self._env_owns_reward:
group.env_rewards.append(environment.get_reward())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, we might need to support async reward also ( like LLM as judge)
so we probably need an async version of the get_reward like get_reward_async.

We also are in the main generation loop. So if a user passes a sync reward function that take time, this will halt the infllght requests loop

@AmineDiro AmineDiro left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a very niice PR that will cleanup the API interface a lot !

@sergiopaniego sergiopaniego left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, thanks!
some parts that need to be revisited:

  • OpenEnv guide
  • Example scripts/notebooks using OpenEnv

GRPO supports **agent training**: the model calls tools during generation and learns from the outcome.

- A **tool** is a plain Python function (sync or async) exposed to the model. Use `tools` for stateless calls (a calculator, a web search).
- An **environment** is the more general form: a stateful object built fresh per rollout, whose public methods are exposed as tools, plus a `reset` lifecycle hook and an optional `get_reward` that lets it own the reward. Use `environment_factory` when you need per-rollout state, a reset hook, or environment-owned reward.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if these ideas here of get_reward could collide with how rewards are retrieved from OpenReward/Harbor @adithya-s-k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants