Environment-owned reward by qgallouedec · Pull Request #6238 · huggingface/trl

qgallouedec · 2026-07-01T23:50:08Z

Implements the RFC in #5912: lets the environment own the reward. If an environment_factory environment defines a reserved get_reward() method (no args → float), it is called once per completed rollout and added as a reward source. So reward_funcs becomes optional.

Important

This adds no new capability. It's an ergonomics change. The same reward was always expressible through reward_funcs; this just lets the environment own it directly, which is the natural formulation for stateful environments.

This matters for stateful / multi-turn environments, where the reward is a function of the environment's internal state (was the word guessed? did the game end in a win?). Today that state has to be leaked back out to a trainer-owned reward_func; now the environment scores itself.

Before

The reward lives in the trainer and reaches into the environment to recompute what it already knows:

class WordleEnv:
    # Reserved methods (not exposed as tools)
    def reset(self, **kwargs):
        self._target = sample(words)
        self._solved = False

    # Public methods (exposed as tools)
    def guess(self, word: str) -> str:   # exposed as a tool
        self._solved = word == self._target
        ...

def solved_reward(environments, **kwargs):
    return [1.0 if env._solved else 0.0 for env in environments]

trainer = GRPOTrainer(
    model=model,
    reward_funcs=solved_reward,          # trainer owns the reward
    train_dataset=dataset,
    environment_factory=WordleEnv,
)

After

The environment scores the episode it just ran:

class WordleEnv:
   # Reserved methods (not exposed as tools)
    def reset(self, **kwargs):  # required
        self._target = sample(words)
        self._solved = False

    def get_reward(self) -> float:  # optional
        return 1.0 if self._solved else 0.0

    # Public methods (exposed as tools)
    def guess(self, word: str) -> str:   # exposed as a tool
        self._solved = word == self._target
        ...

trainer = GRPOTrainer(
    model=model,
    train_dataset=dataset,
    environment_factory=WordleEnv,       # owns the reward; reward_funcs becomes optional
)

Details

get_reward is reserved (like reset): it is not exposed to the model as a tool.
Logged under the environment's class name (rewards/WordleEnv/mean), always weight 1: the environment owns its scale.
Combines with reward_funcs: all sources are summed; reward_weights applies to reward_funcs only.
Raises at init if no reward source is provided (no reward_funcs and no get_reward).
Implemented for GRPOTrainer, DPPOTrainer, GRPOWithReplayBufferTrainer, and AsyncGRPOTrainer. Docs + tests included.

AI writing disclosure

AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.

Note

Medium Risk
Changes core RL reward plumbing and makes reward_funcs optional, so misconfigured trainers now error at init but existing scripts without rewards could break; metric key rename (reward/ → rewards/) may affect dashboards.

Overview
Adds environment-owned rewards for agent training: when an environment_factory class defines optional get_reward() -> float, the trainer calls it once per completed rollout and treats it as an extra reward source (weight 1, logged as rewards/{EnvClassName}/mean|std). reward_funcs is now optional if the environment supplies get_reward; trainer-owned and env-owned rewards sum together, with reward_weights applying only to reward_funcs.

get_reward is reserved like reset—not exposed to the model as a tool—and environment method discovery excludes both reset and get_reward when building tool lists.

GRPOTrainer registers an internal reward wrapper at init and fails fast with ValueError when neither reward_funcs nor env get_reward is provided. AsyncGRPOTrainer / AsyncRolloutWorker capture per-rollout env_rewards at generation time and append them during scoring. Docstrings for DPPO (and related trainers) align with the same contract; GRPO/RLOO docs rename per-function metrics from reward/... to rewards/... and expand the agent-training guide (tools vs environments, combined rewards).

Tests cover env-only reward, coexistence with reward_funcs, and the no-reward-source error.

^{Reviewed by Cursor Bugbot for commit 218ff6f. Bugbot is set up for automated code reviews on this repo. Configure here.}

bot-ci-comment · 2026-07-01T23:53:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

AmineDiro · 2026-07-02T08:06:46Z

        tools = tools or []
        self._standalone_tools = tools  # tools that are not bound to the environment
        self.environment_factory = environment_factory
+        self._env_owns_reward = False


small nit: owns reward points to an env that handles all the reward computation, but we can still pass in separate reward functions?

AmineDiro · 2026-07-02T08:11:28Z

+                    # The environment owns the reward: score it now, while this rollout's environment still holds its
+                    # final state (it is reset only when drawn again for the next rollout).
+                    if self._env_owns_reward:
+                        group.env_rewards.append(environment.get_reward())


Looking at this, we might need to support async reward also ( like LLM as judge)
so we probably need an async version of the get_reward like get_reward_async.

We also are in the main generation loop. So if a user passes a sync reward function that take time, this will halt the infllght requests loop

AmineDiro

that's a very niice PR that will cleanup the API interface a lot !

sergiopaniego

this is great, thanks!
some parts that need to be revisited:

OpenEnv guide
Example scripts/notebooks using OpenEnv

sergiopaniego · 2026-07-02T09:11:47Z

+GRPO supports **agent training**: the model calls tools during generation and learns from the outcome.
+
+- A **tool** is a plain Python function (sync or async) exposed to the model. Use `tools` for stateless calls (a calculator, a web search).
+- An **environment** is the more general form: a stateful object built fresh per rollout, whose public methods are exposed as tools, plus a `reset` lifecycle hook and an optional `get_reward` that lets it own the reward. Use `environment_factory` when you need per-rollout state, a reset hook, or environment-owned reward.


not sure if these ideas here of get_reward could collide with how rewards are retrieved from OpenReward/Harbor @adithya-s-k

Environment-owned reward

43c338a

qgallouedec added 6 commits July 2, 2026 00:22

doc

ad8ffbf

remove @patch.dict(os.environ, {"TRL_EXPERIMENTAL_SILENCE": "1"})

15a19d3

remove test

fcec0b5

rm comment

2d30d28

shorted comment

d6aed17

even more tight

2605de5

qgallouedec requested review from AmineDiro, adithya-s-k and sergiopaniego July 2, 2026 02:48

AmineDiro reviewed Jul 2, 2026

View reviewed changes

sergiopaniego reviewed Jul 2, 2026

View reviewed changes

qgallouedec added 2 commits July 2, 2026 10:36

Merge branch 'main' into environment-owned-reward

21f0ffb

Merge branch 'main' into environment-owned-reward

218ff6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Environment-owned reward#6238

Environment-owned reward#6238
qgallouedec wants to merge 9 commits into
mainfrom
environment-owned-reward

qgallouedec commented Jul 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

bot-ci-comment Bot commented Jul 1, 2026

Uh oh!

AmineDiro Jul 2, 2026

Uh oh!

AmineDiro Jul 2, 2026

Uh oh!

AmineDiro left a comment

Uh oh!

sergiopaniego left a comment

Uh oh!

sergiopaniego Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

qgallouedec commented Jul 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Details

AI writing disclosure

Uh oh!

bot-ci-comment Bot commented Jul 1, 2026

Uh oh!

AmineDiro Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

AmineDiro Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

AmineDiro left a comment

Choose a reason for hiding this comment

Uh oh!

sergiopaniego left a comment

Choose a reason for hiding this comment

Uh oh!

sergiopaniego Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec commented Jul 1, 2026 •

edited by cursor Bot

Loading